Santhosh Thottingal

From devsummit
Tags Languages, Machine Translation, Open Source, Technical Debt, Volunteer Developers
Primary Session Next Steps for Languages and Cross Project Collaboration
Secondary Sessions

Mediawiki is one of the rare software system where the i18n is done right. This infrastructure need timely improvement and maintenance. The technology and resources for supporting that technology is important as 2017 movement strategy states: 'We will build the technical infrastructures that enable us to collect free knowledge in all forms and languages.'. But most of these infrastructure is running under volunteer capacity and no official team responsible.

1. Opensource strategy - Mediawiki language technology is isolated and a less known in general opensource ecosystem. There is a need to have proper ownership, maintenance and feature enhancements as good open source project, so that our contributions comes from other multi lingual projects, while we help with our expertise.

   (a) Our localization file formats and the libraries on top of them are very advanced and supports languages more than any other system. But it was when the libraries made mediawiki independent, other projects started noticing it. We use that independent library(jquery.i18n) for VE, ULS, OOJS-UI and present in mediawiki core. But it is not actively maintained, issues and pull requests not addressed because there is nobody in foundation now in charge of it, except volunteer time. There is a lot of demand for its non jquery, general purpose js library. Code is aged, tech debt is increasing. 
    (b) We developed one of the largest repository of input methods(100+ languages) to support inputting in various languages and an input method library. This is a critical piece of software for many small wikis - jquery.ime. The code is aged, not actively maintained by anybody from foundation, except some in their volunteer time. Not updated to take advantage of browser technology updates about IMEs. This is a mediawiki independent open source library.
    (c) Universal Language selector - a language selection, switching mechanism for our large list of languages, also delivering input methods, fonts, need ownership and tech debt removal. Navigating between different wikis is done using this and now the team authored this system does not exist.This is a also mediawiki independent open source library. VE, Translate, ContentTranslation, Wikidata depends on this library.
      (d) Mediawiki core i18n features(php) are also started showing its age. There were plans to make some of them as standalone opensource libraries. Not happened. Nobody officially responsible for this infrastructure too.

2. The Translate extension - helping to have mediawiki interface available in 300 languages - something that we always proud of - is not officially maintained by foundation now. The localization happens because of volunteers and volunteer maintaining Translate extension code. Moreover, the translatewiki.net, which hosts the Translate where localization happens by volunteers also outside foundation infrastructure.

3. It is time to have machine translation infrastructure within wikimedia. Content translation used machine translation - but that is an isolated product. Translation of content can be used in various contexts for readers. CX tries to provide a service api for MT, There are lot of potential for that. Multiple MT services, even proprietary services might be needed to cover all the languages. At the same time, our content and translations are important for training new opensource MT engines.

4. Wikipedia follows very traditional approach for typography and layout. Language team had limited webfont delivery to aim missing font issue, but too old code not got any updates in last 3 years or so. Language team plans to abandon that feature due to maintenance burden, but not happened and no team now owns it. Other than this a few wikis does common.css hacks to have customization of default fonts. Typography refresh attempt from reading team was for Latin. Every script has its own characteristics about font size, preferred font family sequences, line heights etc. Presenting knowledge in all these language wikis, in 2017 or later need serious thoughts about readability, typography and general aesthetic of wikipedia in a language compared to other websites in that language.