Translation

From devsummit

3 statements.

Author Tags Primary Session Secondary Sessions Position Statement
C. Scott Ananian Censorship, Infrastructure, Languages, Machine Learning, Machine Translation, Translation, User Experience Next Steps for Languages and Cross Project Collaboration Advancing the Contributor Experience

'One World, One Wiki!' Instead of today's many siloed wikis, separated by language and project, our goal should be to re-establish a unified community of collaborators. We will still respect language and cultural differences - there will still be English, German, Hebrew, Arabic, etc. Wikipedias; they will disagree at times - but instead of separate domains, we'll embrace a single user experience with integrated navigation between projects and languages and the possibility of split screen views aligning related content. On a single page we can work on articles in different languages, or simultaneously edit textbook content and encyclopedia articles. Via machine translation we can facilitate conversations and collaborations spanning languages and projects, without forcing a single culture or perspective.

Machine translation plays a key role in removing these barriers and enabling new content and collaborators. We should invest in our own engineers and infrastructure supporting machine translation, especially between minority languages and script variants. Our editing community will continually improve our training data and translation engines, both by explicitly authoring parallel texts (as with the Content Translation Tool) and by micro-contributions such as clicking yes/no on a proposed translation or pair of parallel texts ('bandit learning'). Using 'zero-shot translation' models, our training data from 'big' wikis can improve the translation of 'small' wikis. Every contribution further improves the ability of our tools to make additional articles from other languages available.

A translation suggestion tool will suggest an edit in one language whenever an edit is made to a parallel text in another language. The correspondences can be manually created (for example, via the Content Translation Tool), but our translation engine can also automatically search for and score potential new correspondences, or prune old entries when the translation has drifted. Again, each new correspondence trains the engine and improves its ability to suggest further correspondences and edits.

Red-links and stubs are replaced with article text from one of the user's preferred fallback languages, perhaps split-screened with a machine translation into the user's primary language. This will keep 'small' language wikis sticky, and prevent readers from getting into the habit of searching in a 'big' language first.

We should build clusters specifically for training translation (and other) deep learning models. As a supplement to our relationships with statistical translation tools Moses and Apertium, we should partner with the OpenNMT project for modern neural machine translation research. We should investigate whether machine translation can replace LanguageConverter, our script conversion tool; conversely, our editing fluency in ANY language pair should approach what LanguageConverter provides for its supported languages.

By embracing unity between projects and erasing barriers between languages, we encourage the flow of diverse content from minority languages around the world into all of our wikis, as well as improving the availability of all of our content into indigenous languages. Language tools route around cultural or governmental censorship: by putting parallel texts and translations in the forefront of our UX we expose our differences and challenge preconceptions, learning from each other.

Lucie-Aimée Kaffee Languages, Machine Learning, Translation, Wikidata Next Steps for Languages and Cross Project Collaboration Research, Analytics, and Machine Learning

Languages in the world of Wikimedia

One of the central topics of Wikimedia's world is languages. Currently, we cover around 290 languages in most projects, more or less well covered. In theory, all information in Wikipedia can be replicated and connected, so that different culture's knowledge is interlinked and accessible no matter which language you speak. In reality however, this can be tricky. The authors of [1] show, that even English Wikipedia's content is in big parts not represented in other languages, even in other big Wikipedias. And the other way around: The content in underserved languages is often not covered in English Wikipedia. A possible solution is translation by the community as done with the content translation tool [2]. Nevertheless, that means translation of all language articles into all other languages, which is an effort that's never ending and especially for small language communities barely feasible. And it's not only all about Wikipedia- the other Wikimedia projects will need a similar effort! Another approach for a better coverage of languages in Wikipedia is the ArticlePlaceholder [3]. Using Wikidata's inherently multi- and cross-lingual structure, AP displays data in a readable format on Wikipedias, in their language. However, even Wikipedia has a lack of support for languages as we were able to show in [4]. The question is therefore, how can we get more multilingual data into Wikidata, using the tools and resources we already have, and eventually how to reuse Wikidata's data on Wikipedia and other Wikimedia projects in order to support under-resourced language communities and enable them to access information in their language easier. Accessible content in a language will eventually also mean they are encouraged to contribute to the knowledge. Currently, we investigate machine learning tools in order to support the display of data and the gathering of new multilingual labels for information in Wikidata. It can be assumed, that over the coming years, language accessibility will be one of the key topics for Wikimedia and its projects and it is therefore important to already invest in the topic and enable an exchange about it.


[1] Hecht, B., & Gergle, D. (2010, April). The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 291-300). ACM. [2] https://en.wikipedia.org/wiki/Wikipedia:Content_translation_tool [3] https://commons.wikimedia.org/wiki/File:Generating_Article_Placeholders_from_Wikidata_for_Wikipedia_-_Increasing_Access_to_Free_and_Open_Knowledge.pdf [4] https://eprints.soton.ac.uk/413433/

Niklas Laxström Machine Translation, Translation Next Steps for Languages and Cross Project Collaboration

Translation as a way to grow and connect our communities

The Wikimedia movement depends a lot on translation, but I believe we are not currently using the full potential of it. This affects us in many ways - most importantly: - language barriers isolate communities - but we all need to work together, - our content is not accessible to every human, - our movement is massively multilingual, but not the forerunner in using translation and other language technology.

We should improve our translation tools and leverage machine translation in a sustainable way. Translation should be a core part of our infrastructure and integrate into our projects seamlessly. It will help our communities to grow, as demonstrated by the Content Translation tool. I suggest three focus areas.

  1. 1 Find partners to build high quality open-source machine translation

Our projects run on free software. Currently, we depend a lot on proprietary data-driven (statistical) machine translation. For translation to be an essential part of our infrastructure, then this is neither sustainable nor acceptable. We already use expert-driven (rule-based) open-source machine translation software, e.g. Apertium, which provides some high quality language pairs. However, the proprietary services cover a lot more language pairs, albeit with lower quality. Building machine translation engines is hard work, therefore we should find partners to pursue both data-driven and expert-driven engines. The impact of this could be big and extend beyond our movement.

  1. 2 Bring translation everywhere

We already have good translation tools, but we need to move beyond user interface and Wikipedia pages. We should integrate translation tools into our discussion systems to support multilingual discussions as well as to understand discussions in foreign languages. This should be combined with summarizing tools.

We have a lot of (structured) content that can be translated but doesn't have a proper tooling for translation, e.g. Wikidata and Commons image description, labels in SVG files. We should adapt and integrate our existing translation tools to support these types of content.

We should also make language selection available to all users, including those not logged-in in our multilingual projects, such as Wikidata, to show the translations.

  1. 3 Improve our translation tools

Our translation tools have serious issues that result in slower translations or not being translated at all.

Our translation memory is not working well. It often fails to suggest good matches. This is apparent when translating the Weekly Tech News. Translators' time is wasted when they need to re-translate (introducing inconsistencies) or searching previous translation manually. Without improvement our translation memory is not suitable for use in Content Translation either.

When translating documentation pages, announcements, etc. using the Translate extension, a significant amount of extra markup is added to the wikitext. Editors find this markup inconvenient and justifiably resist using this tool. This feature should be improved so that it works with Visual Editor and doesn't require additional mark-up in the wikitext.