Trey Jones

From devsummit
Tags Languages, Machine Translation
Primary Session Next Steps for Languages and Cross Project Collaboration
Secondary Sessions

My purpose in attending the Dev Summit is to enjoy the benefit of collaborating in person with others who are passionate about technology that brings information to the world in a variety of languages.

When I imagine a world where everyone really can share in all knowledge, I don't imagine all of them doing so in their native language. The most important foundation for language technologies that will reach as many people as possible is informed realism-€”with insights from both linguistics and computer science.

  • The most common estimate of the number of languages is 6,000. An unfortunate number are critically endangered, with only dozens of speakers; 50-90% of them will have no speakers by the end of the century.

Providing knowledge to *everyone* in their own language is unrealistic. We should always seek to support any community working to document, revive, or strengthen a language, but expecting to create and curate extensive knowledge repositories in a language with barely half a dozen octogenarian speakers whose grandchildren have no interest in the language is more fantasy than goal.

  • Statistical machine translation has eclipsed rule-based machine translation for unpaid, casual internet use and building it doesn't require linguists or even speakers. But it does require data, in the form of large parallel corpora, which simply aren't available for most languages.

Even providing knowledge in translation is impractical for most of the world's languages.

  • English speakers are notoriously monolingual, but in many places multilingualism is the norm, with people speaking a home language and a major world language.

A useful planning tool would be an assessment of the most commonly spoken languages among people whose preferred language does not have an extensive Wikipedia. Whether building on the model of Simple English or increasing the readability of the larger Wikipedias, we can bring more knowledge to more people though Hindi/Urdu, Indonesian, Mandarin, French, Arabic, Russian, Spanish, and Swahili-€”all of which boast on the order of 100 million non-native speakers or more-€”than by trying to create a thousand Wikipedias for less commonly spoken languages.

  • English is particularly suited to simple computational processing-€”a fact often lost on English speakers; it uses few characters, has few inflections, and words are conveniently separated.

Navigating copious amounts of knowledge requires search. The simplest form of search just barely works for English, but often fails in Spanish (with dozens of verb forms), Finnish (with thousands of noun forms), Chinese (without spaces), and most other languages. Fortunately, for major world languages we have software that can overcome this by regularizing words for indexing and search.

Again, none of this is to say that we should ever stop or even slow our efforts where there is a passionate language community-€”or even one passionate individual-€”working to build knowledge repositories or language-enabling software. But we must be realistic about what it takes to reach the majority of people in a language they understand.