Wildcards Used to Help Avoid Too Many Near Duplicates
Short, easy-to-remember URL to this page
http://bit.ly/tatoebawildcards
About
- These are "wildcards" or "tokens" used to avoid submitting too many near duplicate sentences.
- "Near duplicates" are things like simple transformations or similar sentences that any moderately competent second language learner can create on his own. For example: I'm happy. => I am happy. / Tom's happy. => He's happy. => She's happy.
- Of course, sometimes near duplicates are unavoidable.
The Wildcards
- Tom, Mary, John, Alice
- Unless a sentence doesn't make sense, use names in this order. If a sentence seems to sound more natural with a female name, use Mary first.
- These were the top 4 names being used in sentences on the Tatoeba Project when these names were chosen, so it seemed logical to choose these.
- This helps avoid near duplicates such as
- Tom likes Mary. Jane likes Fred. Mr. Jones likes Ms. Smith.
- Tom went shopping. Ted went shopping. He went shopping. She went shopping.
- Tom asked Mary to help John. John asked Tom to help Mary.
- Pronouns
- If a sentence sounds natural using the above names, I don't usually contribute sentences with pronouns.
- This helps avoid near duplicates such as
- Tom swims. He swims. She swims.
- Tom and Mary swim. They swim.
- Give Tom this. Give him this. Give her this.
- Jackson (If you need a 2nd family name in the same sentence, use Smith.)
- Default family name (surname). Mr. Jackson, Dr. Jackson, Mr. and Mrs. Jackson. Tom Jackson (when full name is needed.)
- Boston
- If you need a city name, use Boston, whenever you can. Of course, sometimes a specific city needs to be mentioned (___ is the largest city in Australia.)
If another city name is needed in the same sentence, use Chicago.
- Australia
- Default country name.
- Canadian
- Default nationality.
- Monday
- Default day of the week.
- October
- Default month.
- October 20th.
- Default date.
- Thirty
- Default age when it doesn't matter. However, sometimes younger makes sense (13) or (3), or older "Tom retired at 65."
- Three, 13, 2013
- When different numbers seem more appropriate, use one with a 3, if possible.
- 2:30
- Default time. (Use 6:30 for early morning, early evening.)
- French, English
- Default language used in this order
- I study French and English.
- Harvard
- Default university name.
- Cookie
- Default pet name. dog, cat, hamster, ...
- Park Street
- Default street name. It is the first non-numbered street name on the list of high-frequency street names in the USA.
It was already being used in the Tatoeba Corpus.
- If other street names are needed, use in this order: Park Street, Main Street, Oak Street, Pine Street, Maple Street (in the frequency order found by US Census Bureau.)
- Contractions
- Use contractions whenever they sound more natural. This helps make the audio files more natural-sounding.
- For example, "I don't like to jog," instead of "I do not like to jog."
- Of course, for sentences that would primarily be used as written language, contractions may not sound natural.
- Punctuation
- I don't use exclamation marks (!) when periods (.) will do. For out-of-context sentence examples, this is more natural, I think.
- Note: Many non-native English speakers tend to overuse exclamation marks. Sometimes, I have to submit a "near duplicate" in order to have the more natural example sentence for use in my projects.
- I use punctuation within quotes, when that's the standard way Americans do it.
Still incomplete: CK, 2013-02-21, updated 2013-08-10, 2013-12-30, 2014-01-26, 2015-12-27
Demos
- Wildcard Demo #1
2013 - This shows various simple English-only substitutions.
- Wildcard Demo #2
2017 - This one shows bilingual English-Japanese substitutions. There are many subpages, showing various things.
- Wildcard Demo #3
2017 - This is similar to Demo #1, but just quickly shows the name "Tom" being changed to the name "Fadil".