A Brief History of the Tatoeba Project (Petit historique)

Disclaimer

Timeline

2001 (and before):
The Tanaka Corpus (which eventually was imported into the Tatoeba Project) was being compiled by Professor Yasuhito Tanaka and his students at Hyogo University
2003:
Jim Breen integrates the Tanaka Corpus into his WWWJDIC server
Screenshot of an Early Version

Click image to see it larger.
2006-08-15:
Trang's dictionary project
This is a multilingual dictionary. A Wikipedia type of thing, except that people add sentences, not articles. The aim : to build a large database of sentences translated into various languages that everyone can access for free.
http://sourceforge.net/projects/multilangdict/
2006-08-18:
First News: Help Wanted http://sourceforge.net/news/?group_id=175276
2007-05-13:
This "dictionary" is soon going to be hosted on my university server.
2008-01-20:
www.manythings.org/corpus started using the Tanaka Corpus that was being maintained by Jim Breen and Paul Blay.
2008-01:
the domain name http://tatoeba.fr was registered
2008-06:
the domain name http://tatoeba.org was registered
2008-10-20
Date of the last public domain version of the Tanaka Corpus. ftp.monash.edu.au/pub/nihongo/examples_pd.gz
Tatoeba Project Sentence 237582 is perhaps the highest numbered Japanese sentence
Tatoeba Project Sentence 329712 is perhaps the highest numbered English sentence
Some of the very low numbered sentences, I think, were not part of the Tanaka Corpus.
2009-01-16:
Previous log data was lost with migration. All dates set to unknown. However, for a few thousand sentences author names were retrieved and added.
2009-01-16:
The forum wasn't migrated.
2009-01-22:
330,000 sentences - 150,000 in English, about the same in Japanese and 24,000 in French.
2009-01-22:
The original 24,000 French sentences came from Tokidoki (http://tokidoki.fr/ - currently offline) and were given to the Tatoeba Project.
2009-01-31:
The project moved to a new server and started using the domain name tatoeba.org.
2009-02-19:
Kakashi Japanese-to-Romaji converter re-implemented. However, it's far from perfect.
2009-04-04:
For the record, there are currently over 150,000 sentences in English and Japanese, and about 24,000 French translations for these 150,000 sentences.
2009-10-22: Blog Entry Date
Originally started with sentences from the Tanaka Corpus which had 212,000 sentence pairs. These were cleaned up quite a bit by Jim Breen and Paul Blay before being imported into the Tatoeba Project.
2009-10-22:
The French translations that were given to me were the result of the work of 80 volunteers.
2009-11-28:
Trang says: So the concept is : we gather a lot of data, try to organize it, ensure it is of good quality and make it freely accessible, downloadable and redistributable, so that anyone who has a great idea for a language learning application (or a language tool) can just focus on coding the application and rely on us to provide data of excellent quality.
2009-12-13:
Jim Breen starts using the Tatoeba Project to maintain the version of the Tanaka Corpus being used by WWWJDIC.
2009-12-13:
Trang announces changing to the Creative Commons Attribution license from Public domain
2010-02-13:
On February 13th we went to an event organized by an association based in Paris called Shtooka (http://shtooka.net/ - currently offline).
2010-02-23:
Trang blogs the first time about the Tatoeba Project policies for contributions.
2010-03-06:
In addition to the Google auto-detect language, users can now choose the langauge for a submission.
2010-03-06:
Now users can adopt sentences in place instead of being redirected to an info page. (NOTE: The reason she uses the word "adopt," is that what is now called the "owner" of a sentence was originally called the "parent." Thus, also "orphan" sentence.)
2010-03-06:
Only sentence being translated is now shown to prevent new contributors from adding the translation to the wrong sentence.
2010-03-06:
Option to delete comments was introduced. (NOTE: You can't edit them, but you can delete and resubmit.)
2010-03-10:
Chinese sentences are now displayed both in traditional AND simplified.
2010-03-10:
It was announced that a pinyin converter and conversion between simplified/traditional Chinese were added.
2010-03-10:
Pagination of the Wall and latest messages shown on the main page.
2010-03-13:
Pages with just comments's on one user's sentences were added.
2010-03-13:
Pages with just one user's comments added to profiles.
2010-03-13:
Trang invites people to help translate the interface into Japanese, Spanish and German. Texts hosted on Launchpad: https://translations.launchpad.net/tatoeba
2010-04-01:
Audio is introduced with 900 Shanghainese audio files recorded by fucongcong. (The sentences came from shanghaining.com.)
2010-04-02:
Switched to MeCab for handling Japanese Fugigana and Romaji. Romaji now only shows up on mouseover rather than being displayed on the page.
2010-04-02:
The link/unlink feature was added for any member who was what is now called an "advanced contributor." (Used to be "trusted user.")
2010-04-16:
Duplicate sentences to be merged and better looking private messages.
2010-04-16:
Trang announces the move to a new server, provided by the Free Software Foundation in France. (Used to be hosted by the webmaster of tokidoki.fr)
2010-04-18:
Switched search engines from Lucene to Sphinx
2010-04-30:
Trang blogs for the first time about how to improve the reliability of the sentences.
2010-04-30:
Trang talks about the idea of members "voting" on whether a sentence is accurate or not.
2010-04-30:
Trang talks about the possibility of "locking" a sentence once it's considered completely reliable, so even the owner can't change it.
2010-05-01:
Downloadable files are now updated every week.
2010-05-08:
You can also browse sentences that belong to a specific user, and you can filter them by language.
2010-05-08:
You can now add Tatoeba as a search engine in your little Firefox search bar.
2010-05-16:
Contributors can now edit and translate sentences directly from a list (as well as adopt, favorite and add to another list).
2010-05-16:
Indirect translations are taken into account in the search.
2010-05-16:
You can download a list into a file. (NOTE: In 2012, there is a limit on how long a list can be in order to use this function.)
2010-05-16:
You can specify the target language when searching sentences. That is to say, you can not only search "from", but also "to" a specific language.
2010-05-22:
Edit/Show pages for lists. Edit: for editing, translating sentences from the list. Show: simply for viewing the sentences and listening to them.
2010-05-22:
Pagination in lists (so that it won't take forever to load long lists).
2010-05-22:
Possibility to specify language of next and previous links in "Browse" section.
2010-05-24:
Trang discusses the role of "moderators" (now called "corpus maintainers").
2010-05-30:
We added support for right to left languages (like Arabic). They are not actually displayed right to left.
2010-05-30:
We simplified the registration process.
2010-06-12:
Tags were introduced. (Restricted to "trusted users" (now called "advanced users"). For example, unsafe (to mark sentences that can cause problems, are not suitable for kids, etc).
2010-06-27:
Page that lists all the sentences in a specific language, with possibility to show only those that are NOT translated yet into a certain language. For instance Japanese sentences not yet translated into English. Useful feature for contributors =)
2010-06-27:
Page that lists all the tags.
2010-06-27:
Possibility to filter by language, on the page that lists sentences with a certain tag.
2010-07-04:
The capability to import single sentences or sentence pairs from a text file was added. Contributors should ask one of the admins to do this for you, if you want to contribute this way. Write to team@tatoeba.org.
2010-07-17:
Allan Simon publishes a short article about the Tatoeba Project titled Tatoeba.org, base de données de phrases d'exemple.
"We currently have over 400,000 sentences covering 53 languages ​​and about 4,000 audio files."
First mention of LAMP (php with cakephp framework)
2010-08-03:
Trang blogs about the "submission policy." http://blog.tatoeba.org/2010/08/submission-policy-what-kind-of-content.html
2010-08-07:
Japanese furigana now displayed properly above each kanji.
2010-08-07:
You now have the possibility of displaying comments only on sentences in a certain language.
2010-08-17:
3,465 sentences were added on one day (the record, at the time).
2010-08-25:
Autocompletion of tags was introduced.
2010-08-25:
Tags now organized by popularity.
2010-09-26:
Trang's blog "Warning: you are being disrespectful" http://blog.tatoeba.org/2010/09/warning-you-are-being-disrespectful.html
2010-10-14:
The number of sentences for the top 10 languages: English = 156,000+, Japanese = 153,000+, French = 50,000+, Esperanto = 32,000+, German = 27,000+, Polish = 16,000+, Russian = 15,000+, Spanish = 14,000+, Chinese (Mandarin) = 14,000+, Ukrainian = 13,000+
2010-10-14:
The Tatoeba Project is now supporting a total of 71 languages.
2010-11-07:
Trang writes a post about tag guidelines. http://blog.tatoeba.org/2010/11/tags-guidelines.html
2010-11-13:
Tatoeba Day#1 - Top 3 languages for the day were Arabic (573) Esperanto (354) and German (247).
2010-11-14:
Top 5 Languages: English = 158,000+, Japanese = 153,000+, French = 53,000+, Esperanto = 47,000+, German = 32,000+
2010-11-14:
We've reached 600,000 sentences in total today!
2010-11-21:
New "Member's Page" which displays much more quickly. No last login. "Currently contirbuting" is limited to members contributing the last 400 sentences.
2010-11-21:
Option to set a sentences language to "unknown" was added
2010-11-21:
Owner's name is now displayed on the homepage comments.
2010-11-21:
Tags info. If you hover your mouse over a tag, you will see the id of the user who added it, and the date when it was added. (To see who, http://tatoeba.org/users/show/[id])
2010-12-10:
"Moderators" (now called "corpus maintainers) can now see a list of sentences tagged more than 2 weeks ago.
2010-12-10:
The sentence stats page was created. http://tatoeba.org/eng/stats/sentences_by_language
2010-12-10:
You can now see Wall posts of just one member.
2011-01-09:
Tatoeba Day #2: Objectives: Banners and Improving the Quality of the Corpus
2011-01-25:
Trang blogs about "Legally valid content." http://blog.tatoeba.org/2011/01/legally-valid-content.html
"If there is one thing you will need to remember, it is this: do not add non CC-BY sentences in Tatoeba."
2011-02-19:
We've added a page that lists all the sentences of a user, but with the sentences options (translate, adopt, favorite, etc). This is primarily to make it a bit easier to translate sentences of a specific user.
2011-02-19:
We've added pagination for private messages.
2011-02-19:
We've stabilized the language of the interface. If your interface is in Chinese, and you click on a link where the language is set to Esperanto, you shouldn't see your interface change to Esperanto anymore.
2011-02-19:
When browsing the profile, the sentences, the comments, the favorites or the Wall messages of a user, you will see a menu that will make it easier to jump between each of these pages.
2011-02-21:
Mini Contest for Banners: http://blog.tatoeba.org/2011/02/banners-mini-contest.html
2011-02-26:
Tatoeba Day #3 stats were announced: The top 3 were Shishir (218), brauliobezerra (117), CK (108) It was a day concerned with linking.
2011-03-01:
We've hit 4,00 members.
Of these 4,000 members, 1,795 people have contributions. (More Details)
2011-03-26:
Tatoeba Day #4 (Theme = Exploration): 6 people submitted lists (funny, inspiring, ...)
2011-04-07:
Tatoeba will remember the last language you've picked when you translate or add a sentence (provided you did not set your browser to refuse cookies).
2011-04-07:
The languages of the sentences are indicated in the comments (on the homepage and the comments pages).
2011-04-07:
Translations are now ordered by language (based on the ISO code).
2011-04-07:
You can set your language preferences in your settings. This will filter the (direct and indirect) translations to be displayed only in the languages you've indicated. You will still be able to view sentences that are not in your languages, only the translations are affected. Additionally, the list of languages that you see when you translate or add a new sentence will be restricted to the languages in your settings.
2011-04-17:
Final Banners are posted in Trang's Blog http://blog.tatoeba.org/2011/04/tatoeba-banners.html
2011-04-25:
7100+ visits in one day, a new record for tatoeba.org
2011-05-01:
834,000+ sentences
2011-05-01:
For people who use our data, there is a new file that you can download: sentences_detailed.csv. This file contains additional information about the sentence: the contributor who "owns" the sentence at the time of the export, the date when the sentence was added and the date when it was last modified.
2011-05-01:
The activity timeline page now only displays the number of sentences added for each day in the current month. You can however browse to see the activity for other months. That was in the attempt to make this page a little bit faster to display.
2011-05-01:
Trang mentions sysko eliminating duplicates. (NOTE: Maybe I missed it, but I think this is the first mention of sysko's "duplicate-merging script" in the blog.)
2011-05-01:
Trang posts some stats: http://blog.tatoeba.org/2011/05/languages-stats-and-leaders.html
2011-05-01:
You can filter your private messages to only display those that are unread.
2011-05-11:
TatoebaPeaceKeeper account added.
2011-05-11:
Trang adds a blog with the title "Rules against bad behavior." http://blog.tatoeba.org/2011/05/rules-against-bad-behavior.html
2011-05-17:
Member status names changed: user → contributor, trusted user → advanced contributor, moderator → corpus maintainer (Other Status Names: Spammer, Inactive, Admin)
2012-01-28:
lists.csv was added to the weekly exported files.
2012-01-29:
7866 members (3052 with at least 1 sentence), 1340723 sentences
TOP 10: English = 219401, Japanese = 162619, Esperanto = 138642, French = 118900, German = 99173, Spanish = 91734, Portuguese = 64210, Turkish = 58994, Italian = 49689, Polish = 43712
2012-02-27:
1,388,838 sentences
Top 10 Languages
1		 eng	English	 222440	
2		 jpn	Japanese	 162878	
3		 epo	Esperanto	 142805	
4		 fra	French	 122863	
5		 deu	German	 103325	
6		 spa	Spanish	 97966	
7		 por	Portuguese	 65892	
8		 tur	Turkish	 63450	
9		 ita	Italian	 52366	
10		 pol	Polish	 44033	
2012-03-10 0900 France Time
1,408,440 sentences
Top 20 Languages
1		 eng	English	 223825	
2		 jpn	Japanese	 162988	
3		 epo	Esperanto	 144580	
4		 fra	French	 125119	
5		 deu	German	 104491	
6		 spa	Spanish	 100122	
7		 por	Portuguese	 66892	
8		 tur	Turkish	 66812	
9		 ita	Italian	 53847	
10		 pol	Polish	 44341	
11		 rus	Russian	 38394	
12		 cmn	Chinese	 36390	
13		 heb	Hebrew	 25837	
14		 nld	Dutch	 24609	
15		 hun	Hungarian	 21287	
16		 ukr	Ukrainian	 18098	
17		 nds	Low Saxon	 16232	
18		 pes	Persian	 11683	
19		 isl	Icelandic	 9797	
20		 ara	Arabic	 9167	
2012-03-24 0900 France Time
1,431,623 sentences
Top 20 Languages
1		 eng	English	 226182	
2		 jpn	Japanese	 163563	
3		 epo	Esperanto	 146124	
4		 fra	French	 126714	
5		 deu	German	 106284	
6		 spa	Spanish	 101706	
7		 tur	Turkish	 69990	
8		 por	Portuguese	 68340	
9		 ita	Italian	 54943	
10		 pol	Polish	 44444	
11		 rus	Russian	 39364	
12		 cmn	Chinese	 36494	
13		 heb	Hebrew	 28667	
14		 nld	Dutch	 26237	
15		 hun	Hungarian	 21747	
16		 ukr	Ukrainian	 18113	
17		 nds	Low Saxon	 16260	
18		 pes	Persian	 11807	
19		 isl	Icelandic	 9798	
20		 ara	Arabic	 9226	
2013-01-10 1505 France Time
2,034,622 sentences
Top 20 Languages
1		 eng	English	 280023	
2		 epo	Esperanto	 211605	
3		 fra	French	 175467	
4		 deu	German	 172328	
5		 jpn	Japanese	 168459	
6		 spa	Spanish	 158510	
7		 tur	Turkish	 110422	
8		 por	Portuguese	 97624	
9		 ita	Italian	 93791	
10		 heb	Hebrew	 74624	
11		 rus	Russian	 67215	
12		 pol	Polish	 48297	
13		 cmn	Chinese	 40975	
14		 ber	Berber	 40501	
15		 nld	Dutch	 29478	
16		 hun	Hungarian	 27206	
17		 ukr	Ukrainian	 22405	
18		 nds	Low Saxon	 16325	
19		 ara	Arabic	 13128	
20		 pes	Persian	 13065	
2013-04-27 0133 France Time
2,291,976 sentences
Top 20 Languages

1		 eng	English	 321626	
2		 epo	Esperanto	 246037	
3		 deu	German	 199543	
4		 fra	French	 196711	
5		 spa	Spanish	 176071	
6		 jpn	Japanese	 171192	
7		 tur	Turkish	 111997	
8		 ita	Italian	 108020	
9		 por	Portuguese	 104859	
10		 heb	Hebrew	 90102	
11		 rus	Russian	 81775	
12		 ber	Berber	 53501	
13		 pol	Polish	 52281	
14		 cmn	Chinese	 42525	
15		 hun	Hungarian	 32605	
16		 nld	Dutch	 31076	
17		 ukr	Ukrainian	 22580	
18		 nds	Low Saxon	 16326	
19		 pes	Persian	 13487	
20		 ara	Arabic	 13262	
2014-01-01 0145 France Time
2,824,873 sentences
All 132 Languages

1 eng = English = 393220	
2 epo = Esperanto = 293867	
3 deu = German = 246279	
4 fra = French = 221455	
5 spa = Spanish = 197254	
6 jpn = Japanese = 175047	
7 ita = Italian = 164308	
8 tur = Turkish = 156196	
9 rus = Russian = 149747	
10 por = Portuguese = 132713	
11 heb = Hebrew = 95612	
12 ber = Berber = 85927	
13 pol = Polish = 56314	
14 cmn = Chinese = 44262	
15 hun = Hungarian = 39715	
16 nld = Dutch = 33457	
17 ukr = Ukrainian = 23897	
18 fin = Finnish = 21267	
19 nds = Low Saxon = 16788	
20 mar = Marathi = 16771	
21 dan = Danish = 15944	
22 ara = Arabic = 14733	
23 swe = Swedish = 14457	
24 pes = Persian = 13699	
25 bul = Bulgarian = 12570	
26 lat = Latin = 11946	
27 tlh = Klingon = 10588	
28 jbo = Lojban = 10362	
29 lit = Lithuanian = 10048	
30 isl = Icelandic = 9869	
31 tgl = Tagalog = 9359	
32 ina = Interlingua = 9064	
33 nob = Norwegian (Bokmål) = 8829	
34 ell = Modern Greek = 8503	
35 uig = Uyghur = 7241	
36 srp = Serbian = 6279	
37 vie = Vietnamese = 6043	
38 ces = Czech = 5770	
39 hin = Hindi = 5093	
40 cat = Catalan = 4736	
41 ron = Romanian = 4670	
42 tat = Tatar = 4419	
43 ido = Ido = 4320	
44 bel = Belarusian = 4241	
45 wuu = Shanghainese = 4108	
46 glg = Galician = 3898	
47 yue = Cantonese = 3254	
48 hrv = Croatian = 3122	
49 oci = Occitan = 2701	
50 avk = Kotava = 2576	
51 ind = Indonesian = 2536	
52 kaz = Kazakh = 2103	
53 eus = Basque = 1717	
54 toki = Toki Pona = 1707	
55 slk = Slovak = 1656	
56 afr = Afrikaans = 1621	
57 kor = Korean = 1595	
58 lzh = Literary Chinese = 1581	
59 urd = Urdu = 1389	
60 lvs = Latvian = 1268	
61 orv = Old East Slavic = 1211	
62 est = Estonian = 1027	
63 zsm = Malay = 970	
64 ile = Interlingue = 906	
65 xal = Kalmyk = 855	
66 bre = Breton = 817	
67 mal = Malayalam = 790	
68 arq = Algerian Arabic = 672	
69 prg = Old Prussian = 568	
70 non = Norwegian (Nynorsk) = 561	
71 aze = Azerbaijani = 553	
72 vol = Volapük = 537	
73 yid = Yiddish = 517	
74 gle = Irish = 509	
75 arz = Egyptian Arabic = 487	
76 grn = Guarani = 476	
77 kat = Georgian = 457	
78 mon = Mongolian = 417	
79 gla = Scottish Gaelic = 390	
80 hsb = Upper Sorbian = 362	
81 dsb = Lower Sorbian = 355	
82 bos = Bosnian = 342	
83 cym = Welsh = 308	
84 kur = Kurdish = 305	
85 sqi = Albanian = 297	
86 ckt = Chukchi = 263	
87 slv = Slovenian = 261	
88 tha = Thai = 259	
89 uzb = Uzbek = 255	
90 que = Quechua = 230	
91 khm = Khmer = 214	
92 swh = Swahili = 181	
93 ben = Bengali = 180	
94 oss = Ossetian = 157	
95 hye = Armenian = 157	
96 ang = Old English = 133	
97 fry = Frisian = 112	
98 qya = Quenya = 112	
99 nov = Novial = 108	
100 xho = Xhosa = 102	
101 mlt = Maltese = 101	
102 pcd = Picard = 100	
103 grc = Ancient Greek = 86	
104 lld = Ladin = 77	
105 tpi = Tok Pisin = 75	
106 lad = Ladino = 58	
107 tel = Telugu = 50	
108 unknown = unknown = 49	
109 tgk = Tajik = 48	
110 cycl = CycL = 45	
111 ast = Asturian = 44	
112 bod = Standard Tibetan = 41	
113 sjn = Sindarin = 40	
114 acm = Iraqi Arabic = 37	
115 pnb = Punjabi = 34	
116 tpw = Old Tupi = 34	
117 pms = Piemontese = 31	
118 cor = Cornish = 28	
119 fao = Faroese = 27	
120 cha = Chamorro = 27	
121 san = Sanskrit = 24	
122 ewe = Ewe = 22	
123 lao = Lao = 21	
124 mlg = Malagasy = 20	
125 hil = Hiligaynon = 19	
126 scn = Sicilian = 18	
127 mri = Maori = 18	
128 ain = Ainu = 17	
129 roh = Romansh = 17	
130 npi = Nepali = 14	
131 ksh = Kölsch = 11	
132 nan = Teochew = 7	
Misc. Stats from 2014-02-22
Generated from the sentences_detailed.csv file.

2,934,118 sentences total
  * 1,953,634 (66.58%) of these are by identified native speakers.

26,617 duplicates
0.91% are duplicates

2,432 sentences with no language identified.
0.08% have no language identified.

2,533,193 sentences in languages with identified native speakers.
  * 1,953,634 (72.12%) of these are by identified native speakers. (bit.ly/nativespeakers)

Percentage by Native Speakers
72.12% sentences in languages with identified native speaker are by identified native speakers.
66.58% of all sentences are by native speakers.

400,150 sentences in languages with no identified native speakers.
  * 308,518 (77.10%) of these are Esperanto.

## 11,164 usernames have contributed sentences. ##
231 usernames have contributed over 1,000 sentences.
921 usernames have contributed 100 to 999 sentences.
596 usernames have contributed 50 to 99 sentences.
2,065 usernames have contributed 10 to 49 sentences.
933 usernames have contributed 6 to 9 sentences.
6,417 usernames have contributed 1 to 6 sentences.


## Number of English Sentences
## in CK's Subset of 272,765 "English Sentences to Use"
## (See http://bit.ly/tatoebafilters)
## Translated by Native Speakers

tur =  141755  (141755/272765 = 52%)
deu =  79656   (79656/272765  = 29%)
spa =  73448   (73448/272765  = 27%)
fra =  58145   (58145/272765  = 21%)
rus =  53013   (53013/272765  = 19%)
ita =  45178   (45178/272765  = 17%)
por =  33181   (33181/272765  = 12%)
jpn =  29271   (29271/272765  = 11%)
heb =  24907   (24907/272765  = 9%)
ber =  21403   (21403/272765  = 8%)
cmn =  13897   (13897/272765  = 5%)
pol =  12904   (12904/272765  = 5%)
nld =  11906   (11906/272765  = 4%)
ukr =  9459    (9459/272765   = 3%)
mar =  9231    (9231/272765   = 3%)
bul =  7678    (7678/272765   = 3%)
fin =  7032    (7032/272765   = 3%)
swe =  6895    (6895/272765   = 3%)
ara =  6686    (6686/272765   = 2%)
dan =  5449
isl =  5259
hun =  4817
nob =  3644
tgl =  3277
nds =  3090
hin =  2427
yue =  1549
ron =  1468
pes =  1449
bel =  1177
vie =  864
urd =  824
ces =  605
mal =  537
lvs =  474
ind =  419
cat =  371
slk =  333
hrv =  326
afr =  317
ben =  295
non =  165
uig =  125
srp =  111
ell =  106
glg =  81
xal =  68
zsm =  51
kur =  28
tat =  22
acm =  5