- 2001 (and before):
- The Tanaka Corpus (which eventually was imported into the Tatoeba Project) was being compiled by Professor Yasuhito Tanaka and his students at Hyogo University
- 2003:
- Jim Breen integrates the Tanaka Corpus into his WWWJDIC server
- 2006-08-15:
- Trang's dictionary project
This is a multilingual dictionary. A Wikipedia type of thing, except that people add sentences, not articles. The aim : to build a large database of sentences translated into various languages that everyone can access for free.
http://sourceforge.net/projects/multilangdict/ - 2006-08-18:
- First News: Help Wanted http://sourceforge.net/news/?group_id=175276
- 2007-05-13:
- This "dictionary" is soon going to be hosted on my university server.
- 2008-01-20:
- www.manythings.org/corpus started using the Tanaka Corpus that was being maintained by Jim Breen and Paul Blay.
- 2008-01:
- the domain name http://tatoeba.fr was registered
- 2008-06:
- the domain name http://tatoeba.org was registered
- 2008-10-20
- Date of the last public domain version of the Tanaka Corpus. ftp.monash.edu.au/pub/nihongo/examples_pd.gz
Tatoeba Project Sentence 237582 is perhaps the highest numbered Japanese sentence
Tatoeba Project Sentence 329712 is perhaps the highest numbered English sentence
Some of the very low numbered sentences, I think, were not part of the Tanaka Corpus.
- 2009-01-16:
- Previous log data was lost with migration. All dates set to unknown. However, for a few thousand sentences author names were retrieved and added.
- 2009-01-16:
- The forum wasn't migrated.
- 2009-01-22:
- 330,000 sentences - 150,000 in English, about the same in Japanese and 24,000 in French.
- 2009-01-22:
- The original 24,000 French sentences came from Tokidoki (http://tokidoki.fr/ - currently offline) and were given to the Tatoeba Project.
- 2009-01-31:
- The project moved to a new server and started using the domain name tatoeba.org.
- 2009-02-19:
- Kakashi Japanese-to-Romaji converter re-implemented. However, it's far from perfect.
- 2009-04-04:
- For the record, there are currently over 150,000 sentences in English and Japanese, and about 24,000 French translations for these 150,000 sentences.
- 2009-10-22: Blog Entry Date
- Originally started with sentences from the Tanaka Corpus which had 212,000 sentence pairs. These were cleaned up quite a bit by Jim Breen and Paul Blay before being imported into the Tatoeba Project.
- 2009-10-22:
- The French translations that were given to me were the result of the work of 80 volunteers.
- 2009-11-28:
- Trang says: So the concept is : we gather a lot of data, try to organize it, ensure it is of good quality and make it freely accessible, downloadable and redistributable, so that anyone who has a great idea for a language learning application (or a language tool) can just focus on coding the application and rely on us to provide data of excellent quality.
- 2009-12-13:
- Jim Breen starts using the Tatoeba Project to maintain the version of the Tanaka Corpus being used by WWWJDIC.
- 2009-12-13:
- Trang announces changing to the Creative Commons Attribution license from Public domain
- 2010-02-13:
- On February 13th we went to an event organized by an association based in Paris called Shtooka (http://shtooka.net/ - currently offline).
- 2010-02-23:
- Trang blogs the first time about the Tatoeba Project policies for contributions.
- 2010-03-06:
- In addition to the Google auto-detect language, users can now choose the langauge for a submission.
- 2010-03-06:
- Now users can adopt sentences in place instead of being redirected to an info page. (NOTE: The reason she uses the word "adopt," is that what is now called the "owner" of a sentence was originally called the "parent." Thus, also "orphan" sentence.)
- 2010-03-06:
- Only sentence being translated is now shown to prevent new contributors from adding the translation to the wrong sentence.
- 2010-03-06:
- Option to delete comments was introduced. (NOTE: You can't edit them, but you can delete and resubmit.)
- 2010-03-10:
- Chinese sentences are now displayed both in traditional AND simplified.
- 2010-03-10:
- It was announced that a pinyin converter and conversion between simplified/traditional Chinese were added.
- 2010-03-10:
- Pagination of the Wall and latest messages shown on the main page.
- 2010-03-13:
- Pages with just comments's on one user's sentences were added.
- 2010-03-13:
- Pages with just one user's comments added to profiles.
- 2010-03-13:
- Trang invites people to help translate the interface into Japanese, Spanish and German. Texts hosted on Launchpad: https://translations.launchpad.net/tatoeba
- 2010-04-01:
- Audio is introduced with 900 Shanghainese audio files recorded by fucongcong. (The sentences came from shanghaining.com.)
- 2010-04-02:
- Switched to MeCab for handling Japanese Fugigana and Romaji. Romaji now only shows up on mouseover rather than being displayed on the page.
- 2010-04-02:
- The link/unlink feature was added for any member who was what is now called an "advanced contributor." (Used to be "trusted user.")
- 2010-04-16:
- Duplicate sentences to be merged and better looking private messages.
- 2010-04-16:
- Trang announces the move to a new server, provided by the Free Software Foundation in France. (Used to be hosted by the webmaster of tokidoki.fr)
- 2010-04-18:
- Switched search engines from Lucene to Sphinx
- 2010-04-30:
- Trang blogs for the first time about how to improve the reliability of the sentences.
- 2010-04-30:
- Trang talks about the idea of members "voting" on whether a sentence is accurate or not.
- 2010-04-30:
- Trang talks about the possibility of "locking" a sentence once it's considered completely reliable, so even the owner can't change it.
- 2010-05-01:
- Downloadable files are now updated every week.
- 2010-05-08:
- You can also browse sentences that belong to a specific user, and you can filter them by language.
- 2010-05-08:
- You can now add Tatoeba as a search engine in your little Firefox search bar.
- 2010-05-16:
- Contributors can now edit and translate sentences directly from a list (as well as adopt, favorite and add to another list).
- 2010-05-16:
- Indirect translations are taken into account in the search.
- 2010-05-16:
- You can download a list into a file. (NOTE: In 2012, there is a limit on how long a list can be in order to use this function.)
- 2010-05-16:
- You can specify the target language when searching sentences. That is to say, you can not only search "from", but also "to" a specific language.
- 2010-05-22:
- Edit/Show pages for lists. Edit: for editing, translating sentences from the list. Show: simply for viewing the sentences and listening to them.
- 2010-05-22:
- Pagination in lists (so that it won't take forever to load long lists).
- 2010-05-22:
- Possibility to specify language of next and previous links in "Browse" section.
- 2010-05-24:
- Trang discusses the role of "moderators" (now called "corpus maintainers").
- 2010-05-30:
- We added support for right to left languages (like Arabic). They are not actually displayed right to left.
- 2010-05-30:
- We simplified the registration process.
- 2010-06-12:
- Tags were introduced. (Restricted to "trusted users" (now called "advanced users"). For example, unsafe (to mark sentences that can cause problems, are not suitable for kids, etc).
- 2010-06-27:
- Page that lists all the sentences in a specific language, with possibility to show only those that are NOT translated yet into a certain language. For instance Japanese sentences not yet translated into English. Useful feature for contributors =)
- 2010-06-27:
- Page that lists all the tags.
- 2010-06-27:
- Possibility to filter by language, on the page that lists sentences with a certain tag.
- 2010-07-04:
- The capability to import single sentences or sentence pairs from a text file was added. Contributors should ask one of the admins to do this for you, if you want to contribute this way. Write to team@tatoeba.org.
- 2010-07-17:
- Allan Simon publishes a short article about the Tatoeba Project titled Tatoeba.org, base de données de phrases d'exemple.
"We currently have over 400,000 sentences covering 53 languages and about 4,000 audio files."
First mention of LAMP (php with cakephp framework)
- 2010-08-03:
- Trang blogs about the "submission policy." http://blog.tatoeba.org/2010/08/submission-policy-what-kind-of-content.html
- 2010-08-07:
- Japanese furigana now displayed properly above each kanji.
- 2010-08-07:
- You now have the possibility of displaying comments only on sentences in a certain language.
- 2010-08-17:
- 3,465 sentences were added on one day (the record, at the time).
- 2010-08-25:
- Autocompletion of tags was introduced.
- 2010-08-25:
- Tags now organized by popularity.
- 2010-09-26:
- Trang's blog "Warning: you are being disrespectful" http://blog.tatoeba.org/2010/09/warning-you-are-being-disrespectful.html
- 2010-10-14:
- The number of sentences for the top 10 languages: English = 156,000+, Japanese = 153,000+, French = 50,000+, Esperanto = 32,000+, German = 27,000+, Polish = 16,000+, Russian = 15,000+, Spanish = 14,000+, Chinese (Mandarin) = 14,000+, Ukrainian = 13,000+
- 2010-10-14:
- The Tatoeba Project is now supporting a total of 71 languages.
- 2010-11-07:
- Trang writes a post about tag guidelines. http://blog.tatoeba.org/2010/11/tags-guidelines.html
- 2010-11-13:
- Tatoeba Day#1 - Top 3 languages for the day were Arabic (573) Esperanto (354) and German (247).
- 2010-11-14:
- Top 5 Languages: English = 158,000+, Japanese = 153,000+, French = 53,000+, Esperanto = 47,000+, German = 32,000+
- 2010-11-14:
- We've reached 600,000 sentences in total today!
- 2010-11-21:
- New "Member's Page" which displays much more quickly. No last login. "Currently contirbuting" is limited to members contributing the last 400 sentences.
- 2010-11-21:
- Option to set a sentences language to "unknown" was added
- 2010-11-21:
- Owner's name is now displayed on the homepage comments.
- 2010-11-21:
- Tags info. If you hover your mouse over a tag, you will see the id of the user who added it, and the date when it was added. (To see who, http://tatoeba.org/users/show/[id])
- 2010-12-10:
- "Moderators" (now called "corpus maintainers) can now see a list of sentences tagged more than 2 weeks ago.
- 2010-12-10:
- The sentence stats page was created. http://tatoeba.org/eng/stats/sentences_by_language
- 2010-12-10:
- You can now see Wall posts of just one member.
- 2011-01-09:
- Tatoeba Day #2: Objectives: Banners and Improving the Quality of the Corpus
- 2011-01-25:
- Trang blogs about "Legally valid content." http://blog.tatoeba.org/2011/01/legally-valid-content.html
"If there is one thing you will need to remember, it is this: do not add non CC-BY sentences in Tatoeba."
- 2011-02-19:
- We've added a page that lists all the sentences of a user, but with the sentences options (translate, adopt, favorite, etc). This is primarily to make it a bit easier to translate sentences of a specific user.
- 2011-02-19:
- We've added pagination for private messages.
- 2011-02-19:
- We've stabilized the language of the interface. If your interface is in Chinese, and you click on a link where the language is set to Esperanto, you shouldn't see your interface change to Esperanto anymore.
- 2011-02-19:
- When browsing the profile, the sentences, the comments, the favorites or the Wall messages of a user, you will see a menu that will make it easier to jump between each of these pages.
- 2011-02-21:
- Mini Contest for Banners: http://blog.tatoeba.org/2011/02/banners-mini-contest.html
- 2011-02-26:
- Tatoeba Day #3 stats were announced: The top 3 were Shishir (218), brauliobezerra (117), CK (108) It was a day concerned with linking.
- 2011-03-01:
- We've hit 4,00 members.
Of these 4,000 members, 1,795 people have contributions. (More Details)
- 2011-03-26:
- Tatoeba Day #4 (Theme = Exploration): 6 people submitted lists (funny, inspiring, ...)
- 2011-04-07:
- Tatoeba will remember the last language you've picked when you translate or add a sentence (provided you did not set your browser to refuse cookies).
- 2011-04-07:
- The languages of the sentences are indicated in the comments (on the homepage and the comments pages).
- 2011-04-07:
- Translations are now ordered by language (based on the ISO code).
- 2011-04-07:
- You can set your language preferences in your settings. This will filter the (direct and indirect) translations to be displayed only in the languages you've indicated. You will still be able to view sentences that are not in your languages, only the translations are affected. Additionally, the list of languages that you see when you translate or add a new sentence will be restricted to the languages in your settings.
- 2011-04-17:
- Final Banners are posted in Trang's Blog http://blog.tatoeba.org/2011/04/tatoeba-banners.html
- 2011-04-25:
- 7100+ visits in one day, a new record for tatoeba.org
- 2011-05-01:
- 834,000+ sentences
- 2011-05-01:
- For people who use our data, there is a new file that you can download: sentences_detailed.csv. This file contains additional information about the sentence: the contributor who "owns" the sentence at the time of the export, the date when the sentence was added and the date when it was last modified.
- 2011-05-01:
- The activity timeline page now only displays the number of sentences added for each day in the current month. You can however browse to see the activity for other months. That was in the attempt to make this page a little bit faster to display.
- 2011-05-01:
- Trang mentions sysko eliminating duplicates. (NOTE: Maybe I missed it, but I think this is the first mention of sysko's "duplicate-merging script" in the blog.)
- 2011-05-01:
- Trang posts some stats: http://blog.tatoeba.org/2011/05/languages-stats-and-leaders.html
- 2011-05-01:
- You can filter your private messages to only display those that are unread.
- 2011-05-11:
- TatoebaPeaceKeeper account added.
- 2011-05-11:
- Trang adds a blog with the title "Rules against bad behavior." http://blog.tatoeba.org/2011/05/rules-against-bad-behavior.html
- 2011-05-17:
- Member status names changed: user → contributor, trusted user → advanced contributor, moderator → corpus maintainer (Other Status Names: Spammer, Inactive, Admin)
- 2012-01-28:
- lists.csv was added to the weekly exported files.
- 2012-01-29:
- 7866 members (3052 with at least 1 sentence), 1340723 sentences
TOP 10: English = 219401, Japanese = 162619, Esperanto = 138642, French = 118900, German = 99173, Spanish = 91734, Portuguese = 64210, Turkish = 58994, Italian = 49689, Polish = 43712
- 2012-02-27:
- 1,388,838 sentences
Top 10 Languages
1 eng English 222440
2 jpn Japanese 162878
3 epo Esperanto 142805
4 fra French 122863
5 deu German 103325
6 spa Spanish 97966
7 por Portuguese 65892
8 tur Turkish 63450
9 ita Italian 52366
10 pol Polish 44033
- 2012-03-10 0900 France Time
- 1,408,440 sentences
Top 20 Languages
1 eng English 223825
2 jpn Japanese 162988
3 epo Esperanto 144580
4 fra French 125119
5 deu German 104491
6 spa Spanish 100122
7 por Portuguese 66892
8 tur Turkish 66812
9 ita Italian 53847
10 pol Polish 44341
11 rus Russian 38394
12 cmn Chinese 36390
13 heb Hebrew 25837
14 nld Dutch 24609
15 hun Hungarian 21287
16 ukr Ukrainian 18098
17 nds Low Saxon 16232
18 pes Persian 11683
19 isl Icelandic 9797
20 ara Arabic 9167
- 2012-03-24 0900 France Time
- 1,431,623 sentences
Top 20 Languages
1 eng English 226182
2 jpn Japanese 163563
3 epo Esperanto 146124
4 fra French 126714
5 deu German 106284
6 spa Spanish 101706
7 tur Turkish 69990
8 por Portuguese 68340
9 ita Italian 54943
10 pol Polish 44444
11 rus Russian 39364
12 cmn Chinese 36494
13 heb Hebrew 28667
14 nld Dutch 26237
15 hun Hungarian 21747
16 ukr Ukrainian 18113
17 nds Low Saxon 16260
18 pes Persian 11807
19 isl Icelandic 9798
20 ara Arabic 9226
- 2013-01-10 1505 France Time
- 2,034,622 sentences
Top 20 Languages
1 eng English 280023
2 epo Esperanto 211605
3 fra French 175467
4 deu German 172328
5 jpn Japanese 168459
6 spa Spanish 158510
7 tur Turkish 110422
8 por Portuguese 97624
9 ita Italian 93791
10 heb Hebrew 74624
11 rus Russian 67215
12 pol Polish 48297
13 cmn Chinese 40975
14 ber Berber 40501
15 nld Dutch 29478
16 hun Hungarian 27206
17 ukr Ukrainian 22405
18 nds Low Saxon 16325
19 ara Arabic 13128
20 pes Persian 13065
- 2013-04-27 0133 France Time
- 2,291,976 sentences
Top 20 Languages
1 eng English 321626
2 epo Esperanto 246037
3 deu German 199543
4 fra French 196711
5 spa Spanish 176071
6 jpn Japanese 171192
7 tur Turkish 111997
8 ita Italian 108020
9 por Portuguese 104859
10 heb Hebrew 90102
11 rus Russian 81775
12 ber Berber 53501
13 pol Polish 52281
14 cmn Chinese 42525
15 hun Hungarian 32605
16 nld Dutch 31076
17 ukr Ukrainian 22580
18 nds Low Saxon 16326
19 pes Persian 13487
20 ara Arabic 13262
- 2014-01-01 0145 France Time
- 2,824,873 sentences
All 132 Languages
1 eng = English = 393220
2 epo = Esperanto = 293867
3 deu = German = 246279
4 fra = French = 221455
5 spa = Spanish = 197254
6 jpn = Japanese = 175047
7 ita = Italian = 164308
8 tur = Turkish = 156196
9 rus = Russian = 149747
10 por = Portuguese = 132713
11 heb = Hebrew = 95612
12 ber = Berber = 85927
13 pol = Polish = 56314
14 cmn = Chinese = 44262
15 hun = Hungarian = 39715
16 nld = Dutch = 33457
17 ukr = Ukrainian = 23897
18 fin = Finnish = 21267
19 nds = Low Saxon = 16788
20 mar = Marathi = 16771
21 dan = Danish = 15944
22 ara = Arabic = 14733
23 swe = Swedish = 14457
24 pes = Persian = 13699
25 bul = Bulgarian = 12570
26 lat = Latin = 11946
27 tlh = Klingon = 10588
28 jbo = Lojban = 10362
29 lit = Lithuanian = 10048
30 isl = Icelandic = 9869
31 tgl = Tagalog = 9359
32 ina = Interlingua = 9064
33 nob = Norwegian (Bokmål) = 8829
34 ell = Modern Greek = 8503
35 uig = Uyghur = 7241
36 srp = Serbian = 6279
37 vie = Vietnamese = 6043
38 ces = Czech = 5770
39 hin = Hindi = 5093
40 cat = Catalan = 4736
41 ron = Romanian = 4670
42 tat = Tatar = 4419
43 ido = Ido = 4320
44 bel = Belarusian = 4241
45 wuu = Shanghainese = 4108
46 glg = Galician = 3898
47 yue = Cantonese = 3254
48 hrv = Croatian = 3122
49 oci = Occitan = 2701
50 avk = Kotava = 2576
51 ind = Indonesian = 2536
52 kaz = Kazakh = 2103
53 eus = Basque = 1717
54 toki = Toki Pona = 1707
55 slk = Slovak = 1656
56 afr = Afrikaans = 1621
57 kor = Korean = 1595
58 lzh = Literary Chinese = 1581
59 urd = Urdu = 1389
60 lvs = Latvian = 1268
61 orv = Old East Slavic = 1211
62 est = Estonian = 1027
63 zsm = Malay = 970
64 ile = Interlingue = 906
65 xal = Kalmyk = 855
66 bre = Breton = 817
67 mal = Malayalam = 790
68 arq = Algerian Arabic = 672
69 prg = Old Prussian = 568
70 non = Norwegian (Nynorsk) = 561
71 aze = Azerbaijani = 553
72 vol = Volapük = 537
73 yid = Yiddish = 517
74 gle = Irish = 509
75 arz = Egyptian Arabic = 487
76 grn = Guarani = 476
77 kat = Georgian = 457
78 mon = Mongolian = 417
79 gla = Scottish Gaelic = 390
80 hsb = Upper Sorbian = 362
81 dsb = Lower Sorbian = 355
82 bos = Bosnian = 342
83 cym = Welsh = 308
84 kur = Kurdish = 305
85 sqi = Albanian = 297
86 ckt = Chukchi = 263
87 slv = Slovenian = 261
88 tha = Thai = 259
89 uzb = Uzbek = 255
90 que = Quechua = 230
91 khm = Khmer = 214
92 swh = Swahili = 181
93 ben = Bengali = 180
94 oss = Ossetian = 157
95 hye = Armenian = 157
96 ang = Old English = 133
97 fry = Frisian = 112
98 qya = Quenya = 112
99 nov = Novial = 108
100 xho = Xhosa = 102
101 mlt = Maltese = 101
102 pcd = Picard = 100
103 grc = Ancient Greek = 86
104 lld = Ladin = 77
105 tpi = Tok Pisin = 75
106 lad = Ladino = 58
107 tel = Telugu = 50
108 unknown = unknown = 49
109 tgk = Tajik = 48
110 cycl = CycL = 45
111 ast = Asturian = 44
112 bod = Standard Tibetan = 41
113 sjn = Sindarin = 40
114 acm = Iraqi Arabic = 37
115 pnb = Punjabi = 34
116 tpw = Old Tupi = 34
117 pms = Piemontese = 31
118 cor = Cornish = 28
119 fao = Faroese = 27
120 cha = Chamorro = 27
121 san = Sanskrit = 24
122 ewe = Ewe = 22
123 lao = Lao = 21
124 mlg = Malagasy = 20
125 hil = Hiligaynon = 19
126 scn = Sicilian = 18
127 mri = Maori = 18
128 ain = Ainu = 17
129 roh = Romansh = 17
130 npi = Nepali = 14
131 ksh = Kölsch = 11
132 nan = Teochew = 7
- Misc. Stats from 2014-02-22
Generated from the sentences_detailed.csv file.
2,934,118 sentences total
* 1,953,634 (66.58%) of these are by identified native speakers.
26,617 duplicates
0.91% are duplicates
2,432 sentences with no language identified.
0.08% have no language identified.
2,533,193 sentences in languages with identified native speakers.
* 1,953,634 (72.12%) of these are by identified native speakers. (bit.ly/nativespeakers)
Percentage by Native Speakers
72.12% sentences in languages with identified native speaker are by identified native speakers.
66.58% of all sentences are by native speakers.
400,150 sentences in languages with no identified native speakers.
* 308,518 (77.10%) of these are Esperanto.
## 11,164 usernames have contributed sentences. ##
231 usernames have contributed over 1,000 sentences.
921 usernames have contributed 100 to 999 sentences.
596 usernames have contributed 50 to 99 sentences.
2,065 usernames have contributed 10 to 49 sentences.
933 usernames have contributed 6 to 9 sentences.
6,417 usernames have contributed 1 to 6 sentences.
## Number of English Sentences
## in CK's Subset of 272,765 "English Sentences to Use"
## (See http://bit.ly/tatoebafilters)
## Translated by Native Speakers
tur = 141755 (141755/272765 = 52%)
deu = 79656 (79656/272765 = 29%)
spa = 73448 (73448/272765 = 27%)
fra = 58145 (58145/272765 = 21%)
rus = 53013 (53013/272765 = 19%)
ita = 45178 (45178/272765 = 17%)
por = 33181 (33181/272765 = 12%)
jpn = 29271 (29271/272765 = 11%)
heb = 24907 (24907/272765 = 9%)
ber = 21403 (21403/272765 = 8%)
cmn = 13897 (13897/272765 = 5%)
pol = 12904 (12904/272765 = 5%)
nld = 11906 (11906/272765 = 4%)
ukr = 9459 (9459/272765 = 3%)
mar = 9231 (9231/272765 = 3%)
bul = 7678 (7678/272765 = 3%)
fin = 7032 (7032/272765 = 3%)
swe = 6895 (6895/272765 = 3%)
ara = 6686 (6686/272765 = 2%)
dan = 5449
isl = 5259
hun = 4817
nob = 3644
tgl = 3277
nds = 3090
hin = 2427
yue = 1549
ron = 1468
pes = 1449
bel = 1177
vie = 864
urd = 824
ces = 605
mal = 537
lvs = 474
ind = 419
cat = 371
slk = 333
hrv = 326
afr = 317
ben = 295
non = 165
uig = 125
srp = 111
ell = 106
glg = 81
xal = 68
zsm = 51
kur = 28
tat = 22
acm = 5