Next Steps for Languages and Cross Project Collaboration

From devsummit
Related Phabricator Task T183314
Topic Leaders Lydia Pintscher, Santhosh Thottingal

9 primary statements. 1 secondary statements.


Loading...
  • Censorship
  • Collaboration
  • Communities
  • Cross-wiki
  • Gadgets
  • Infrastructure
  • Languages
  • Lua
  • Machine Learning
  • Machine Translation
  • New Users
  • Open Source
  • Social Impact
  • Strategy
  • Structured Data
  • Technical Debt
  • Templates
  • Tools
  • Translation
  • User Experience
  • Volunteer Developers
  • Wikidata
  • Wiktionary

Primary

Author Tags Primary Session Secondary Sessions Position Statement
C. Scott Ananian Censorship, Infrastructure, Languages, Machine Learning, Machine Translation, Translation, User Experience Next Steps for Languages and Cross Project Collaboration Advancing the Contributor Experience

'One World, One Wiki!' Instead of today's many siloed wikis, separated by language and project, our goal should be to re-establish a unified community of collaborators. We will still respect language and cultural differences - there will still be English, German, Hebrew, Arabic, etc. Wikipedias; they will disagree at times - but instead of separate domains, we'll embrace a single user experience with integrated navigation between projects and languages and the possibility of split screen views aligning related content. On a single page we can work on articles in different languages, or simultaneously edit textbook content and encyclopedia articles. Via machine translation we can facilitate conversations and collaborations spanning languages and projects, without forcing a single culture or perspective.

Machine translation plays a key role in removing these barriers and enabling new content and collaborators. We should invest in our own engineers and infrastructure supporting machine translation, especially between minority languages and script variants. Our editing community will continually improve our training data and translation engines, both by explicitly authoring parallel texts (as with the Content Translation Tool) and by micro-contributions such as clicking yes/no on a proposed translation or pair of parallel texts ('bandit learning'). Using 'zero-shot translation' models, our training data from 'big' wikis can improve the translation of 'small' wikis. Every contribution further improves the ability of our tools to make additional articles from other languages available.

A translation suggestion tool will suggest an edit in one language whenever an edit is made to a parallel text in another language. The correspondences can be manually created (for example, via the Content Translation Tool), but our translation engine can also automatically search for and score potential new correspondences, or prune old entries when the translation has drifted. Again, each new correspondence trains the engine and improves its ability to suggest further correspondences and edits.

Red-links and stubs are replaced with article text from one of the user's preferred fallback languages, perhaps split-screened with a machine translation into the user's primary language. This will keep 'small' language wikis sticky, and prevent readers from getting into the habit of searching in a 'big' language first.

We should build clusters specifically for training translation (and other) deep learning models. As a supplement to our relationships with statistical translation tools Moses and Apertium, we should partner with the OpenNMT project for modern neural machine translation research. We should investigate whether machine translation can replace LanguageConverter, our script conversion tool; conversely, our editing fluency in ANY language pair should approach what LanguageConverter provides for its supported languages.

By embracing unity between projects and erasing barriers between languages, we encourage the flow of diverse content from minority languages around the world into all of our wikis, as well as improving the availability of all of our content into indigenous languages. Language tools route around cultural or governmental censorship: by putting parallel texts and translations in the forefront of our UX we expose our differences and challenge preconceptions, learning from each other.

Amanda Bittaker Social Impact, Strategy Next Steps for Languages and Cross Project Collaboration

Title Frameworks to connect infrastructure to the mission

Thesis We will better achieve social impact, succeed in our strategy, and fulfill our mission when the Foundation uses non profit programmatic frameworks when prioritizing and planning improvements to MediaWiki and other technologies.

Proposal Impact is an intangible, abstract social benefit and it can be difficult to consider how changes we make in MediaWiki will help or harm it. To illustrate the connections between infrastructure choices and impact and to incorporate those connections into our plans, we can use programmatic frameworks developed in the nonprofit professional communities. Frameworks used by these nonprofit communities for various types of programs and impact can explicitly and concretely link our engineering choices to the movement strategy and the social benefit we create. This increased attention to the social impact of our technical decisions and investments will in turn create increased investment from our communities, partners and potential allies beyond our community towards fulfilling our mission.

WMF programs such as New Readers and Structured Data on Commons, and Wikimedia community programs, such as Wiki Loves Monuments, model how building technology for well-defined social impact can structure our engineering and infrastructure choices towards more strategic and mission driven impact.

These programmatic frameworks can be helpful during annual planning, quarterly check ins, and throughout the process of deciding on, planning, implementing, and evaluating technological changes. We would be able to weigh and design intentionally for broad-end users while also supporting the targeted and specific organizing communities who use our technology towards our desired social impact. We could expand the impact that we achieve by consulting expert communities, such as educators, librarians, and activists, who will design additional social-impact programs and processes on top of those tools. We could also identify parts of our communities which already create desired impacts, and build technologies and technological services which increase the scale, effectiveness and efficiency of organizing contributors to fulfill our mission. Socio-technological decisions in our movement can be most successfully achieved when considering both social and technological benefits.

Trey Jones Languages, Machine Translation Next Steps for Languages and Cross Project Collaboration

My purpose in attending the Dev Summit is to enjoy the benefit of collaborating in person with others who are passionate about technology that brings information to the world in a variety of languages.

When I imagine a world where everyone really can share in all knowledge, I don't imagine all of them doing so in their native language. The most important foundation for language technologies that will reach as many people as possible is informed realism-with insights from both linguistics and computer science.

  • The most common estimate of the number of languages is 6,000. An unfortunate number are critically endangered, with only dozens of speakers; 50-90% of them will have no speakers by the end of the century.

Providing knowledge to *everyone* in their own language is unrealistic. We should always seek to support any community working to document, revive, or strengthen a language, but expecting to create and curate extensive knowledge repositories in a language with barely half a dozen octogenarian speakers whose grandchildren have no interest in the language is more fantasy than goal.

  • Statistical machine translation has eclipsed rule-based machine translation for unpaid, casual internet use and building it doesn't require linguists or even speakers. But it does require data, in the form of large parallel corpora, which simply aren't available for most languages.

Even providing knowledge in translation is impractical for most of the world's languages.

  • English speakers are notoriously monolingual, but in many places multilingualism is the norm, with people speaking a home language and a major world language.

A useful planning tool would be an assessment of the most commonly spoken languages among people whose preferred language does not have an extensive Wikipedia. Whether building on the model of Simple English or increasing the readability of the larger Wikipedias, we can bring more knowledge to more people though Hindi/Urdu, Indonesian, Mandarin, French, Arabic, Russian, Spanish, and Swahili-all of which boast on the order of 100 million non-native speakers or more-than by trying to create a thousand Wikipedias for less commonly spoken languages.

  • English is particularly suited to simple computational processing-a fact often lost on English speakers; it uses few characters, has few inflections, and words are conveniently separated.

Navigating copious amounts of knowledge requires search. The simplest form of search just barely works for English, but often fails in Spanish (with dozens of verb forms), Finnish (with thousands of noun forms), Chinese (without spaces), and most other languages. Fortunately, for major world languages we have software that can overcome this by regularizing words for indexing and search.

Again, none of this is to say that we should ever stop or even slow our efforts where there is a passionate language community-or even one passionate individual-working to build knowledge repositories or language-enabling software. But we must be realistic about what it takes to reach the majority of people in a language they understand.

Lucie-Aimée Kaffee Languages, Machine Learning, Translation, Wikidata Next Steps for Languages and Cross Project Collaboration Research, Analytics, and Machine Learning

Languages in the world of Wikimedia

One of the central topics of Wikimedia's world is languages. Currently, we cover around 290 languages in most projects, more or less well covered. In theory, all information in Wikipedia can be replicated and connected, so that different culture's knowledge is interlinked and accessible no matter which language you speak. In reality however, this can be tricky. The authors of [1] show, that even English Wikipedia's content is in big parts not represented in other languages, even in other big Wikipedias. And the other way around: The content in underserved languages is often not covered in English Wikipedia. A possible solution is translation by the community as done with the content translation tool [2]. Nevertheless, that means translation of all language articles into all other languages, which is an effort that's never ending and especially for small language communities barely feasible. And it's not only all about Wikipedia- the other Wikimedia projects will need a similar effort! Another approach for a better coverage of languages in Wikipedia is the ArticlePlaceholder [3]. Using Wikidata's inherently multi- and cross-lingual structure, AP displays data in a readable format on Wikipedias, in their language. However, even Wikipedia has a lack of support for languages as we were able to show in [4]. The question is therefore, how can we get more multilingual data into Wikidata, using the tools and resources we already have, and eventually how to reuse Wikidata's data on Wikipedia and other Wikimedia projects in order to support under-resourced language communities and enable them to access information in their language easier. Accessible content in a language will eventually also mean they are encouraged to contribute to the knowledge. Currently, we investigate machine learning tools in order to support the display of data and the gathering of new multilingual labels for information in Wikidata. It can be assumed, that over the coming years, language accessibility will be one of the key topics for Wikimedia and its projects and it is therefore important to already invest in the topic and enable an exchange about it.


[1] Hecht, B., & Gergle, D. (2010, April). The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 291-300). ACM. [2] https://en.wikipedia.org/wiki/Wikipedia:Content_translation_tool [3] https://commons.wikimedia.org/wiki/File:Generating_Article_Placeholders_from_Wikidata_for_Wikipedia_-_Increasing_Access_to_Free_and_Open_Knowledge.pdf [4] https://eprints.soton.ac.uk/413433/

Ryan Kaldari Gadgets, Languages, Structured Data, Wiktionary Next Steps for Languages and Cross Project Collaboration

How should MediaWiki evolve to support the mission?

One of the greatest barriers to the spread of human knowledge is the barrier of language. While Wikipedia does a great job of supporting hundreds of languages, the amount of content available in most language Wikipedias is still paltry and has a small impact on the knowledge available to speakers of those languages. For a huge percentage of the world's population, the key to unlocking knowledge isn't discovering Wikipedia, but learning new languages. Even for English speakers, the impact of learning a new language can be life-changing and open up many new opportunities.

The Wikimedia Foundation is the steward of one of the greatest repositories of information about language in human history, Wiktionary. Unlike all other dictionaries on Earth, Wiktionary aims to define (in 172 languages) all words from all languages. In other words, not just defining English words in English and French words in French, but also French words in English, English words in French, Latin words in Swahili, Mopan Maya words in Arabic, etc. It's ambitious aim is to be the ultimate Rosetta Stone for the human species.

While Wikipedia is in some respects maturing and gradually yielding diminishing returns for more investment, Wiktionary is still a small and growing project that has yet to fulfill its potential or break into mainstream consciousness the way that Wikipedia has. While one of the impediments to Wiktionary reaching its potential is lack of structured data support, which is being worked on, there are many improvements that could be made in the meantime to improve the usefulness of the site to both readers and editors. These include converting many of the fragile gadgets and site scripts into maintainable extensions, customizing the user interface to more closely match what users expect from a dictionary site, and adding dictionary-specific tools to the editing interface. There is also unexplored potential with building apps around the Wiktionary data, including apps tailored around language learning.

Now that the Wikimedia Foundation has nearly 100 software engineers (and dozens of volunteer developers), it should explore the potential of its lesser known projects, especially Wiktionary, which has the potential to actually make a large impact on the Foundation's mission and bring more of the sum of human knowledge to more people around the globe.

Niklas Laxström Machine Translation, Translation Next Steps for Languages and Cross Project Collaboration

Translation as a way to grow and connect our communities

The Wikimedia movement depends a lot on translation, but I believe we are not currently using the full potential of it. This affects us in many ways - most importantly: - language barriers isolate communities - but we all need to work together, - our content is not accessible to every human, - our movement is massively multilingual, but not the forerunner in using translation and other language technology.

We should improve our translation tools and leverage machine translation in a sustainable way. Translation should be a core part of our infrastructure and integrate into our projects seamlessly. It will help our communities to grow, as demonstrated by the Content Translation tool. I suggest three focus areas.

  1. 1 Find partners to build high quality open-source machine translation

Our projects run on free software. Currently, we depend a lot on proprietary data-driven (statistical) machine translation. For translation to be an essential part of our infrastructure, then this is neither sustainable nor acceptable. We already use expert-driven (rule-based) open-source machine translation software, e.g. Apertium, which provides some high quality language pairs. However, the proprietary services cover a lot more language pairs, albeit with lower quality. Building machine translation engines is hard work, therefore we should find partners to pursue both data-driven and expert-driven engines. The impact of this could be big and extend beyond our movement.

  1. 2 Bring translation everywhere

We already have good translation tools, but we need to move beyond user interface and Wikipedia pages. We should integrate translation tools into our discussion systems to support multilingual discussions as well as to understand discussions in foreign languages. This should be combined with summarizing tools.

We have a lot of (structured) content that can be translated but doesn't have a proper tooling for translation, e.g. Wikidata and Commons image description, labels in SVG files. We should adapt and integrate our existing translation tools to support these types of content.

We should also make language selection available to all users, including those not logged-in in our multilingual projects, such as Wikidata, to show the translations.

  1. 3 Improve our translation tools

Our translation tools have serious issues that result in slower translations or not being translated at all.

Our translation memory is not working well. It often fails to suggest good matches. This is apparent when translating the Weekly Tech News. Translators' time is wasted when they need to re-translate (introducing inconsistencies) or searching previous translation manually. Without improvement our translation memory is not suitable for use in Content Translation either.

When translating documentation pages, announcements, etc. using the Translate extension, a significant amount of extra markup is added to the wikitext. Editors find this markup inconvenient and justifiably resist using this tool. This feature should be improved so that it works with Visual Editor and doesn't require additional mark-up in the wikitext.

Lydia Pintscher Collaboration, Wikidata Next Steps for Languages and Cross Project Collaboration

Breaking Down Barriers To Cross-Project Collaboration


"... a world in which every single human being can freely share in the sum of all knowledge."

Wikipedia has been our flagship for many years now and our main means of achieving our vision. As information consumption and expectations of our readers are changing, Wikipedia needs to adapt. One crucial building block for this adaption is re-using and integrating more content from the other Wikimedia projects and other language versions of Wikipedia. Connecting our projects more is vital for helping especially our smaller communities serve their readers. Surfacing more content from the other Wikimedia projects also gives them a chance to shine, find their audience and do their part in sharing in the sum of all knowledge.

This integration comes with a lot of challenges. Over many years the Wikipedias have lived largely independent from each other and the other Wikimedia projects. This is changing. Sharing and benefiting for example from data on Wikidata means collaborating with people from potentially very different projects, speaking different languages. It brings a perceived loss of local control. Editors see them-self first and foremost as editors of "their" Wikipedia at the moment and often don't perceive this integration as worth the effort - especially on the larger projects.

We need to address this on both the social and technical level if we want to bring our projects closer together and have them benefit from each other's strength and compensate their weaknesses.

We need to think about and find answers to these questions: What can we do in order to bring our projects together more closely? How can we help break down perceived and real barriers for cross-project work? What can we do to make cross-project collaboration easier?

Santhosh Thottingal Languages, Machine Translation, Open Source, Technical Debt, Volunteer Developers Next Steps for Languages and Cross Project Collaboration

Mediawiki is one of the rare software system where the i18n is done right. This infrastructure need timely improvement and maintenance. The technology and resources for supporting that technology is important as 2017 movement strategy states: 'We will build the technical infrastructures that enable us to collect free knowledge in all forms and languages.'. But most of these infrastructure is running under volunteer capacity and no official team responsible.

1. Opensource strategy - Mediawiki language technology is isolated and a less known in general opensource ecosystem. There is a need to have proper ownership, maintenance and feature enhancements as good open source project, so that our contributions comes from other multi lingual projects, while we help with our expertise.

   (a) Our localization file formats and the libraries on top of them are very advanced and supports languages more than any other system. But it was when the libraries made mediawiki independent, other projects started noticing it. We use that independent library(jquery.i18n) for VE, ULS, OOJS-UI and present in mediawiki core. But it is not actively maintained, issues and pull requests not addressed because there is nobody in foundation now in charge of it, except volunteer time. There is a lot of demand for its non jquery, general purpose js library. Code is aged, tech debt is increasing. 
    (b) We developed one of the largest repository of input methods(100+ languages) to support inputting in various languages and an input method library. This is a critical piece of software for many small wikis - jquery.ime. The code is aged, not actively maintained by anybody from foundation, except some in their volunteer time. Not updated to take advantage of browser technology updates about IMEs. This is a mediawiki independent open source library.
    (c) Universal Language selector - a language selection, switching mechanism for our large list of languages, also delivering input methods, fonts, need ownership and tech debt removal. Navigating between different wikis is done using this and now the team authored this system does not exist.This is a also mediawiki independent open source library. VE, Translate, ContentTranslation, Wikidata depends on this library.
      (d) Mediawiki core i18n features(php) are also started showing its age. There were plans to make some of them as standalone opensource libraries. Not happened. Nobody officially responsible for this infrastructure too.

2. The Translate extension - helping to have mediawiki interface available in 300 languages - something that we always proud of - is not officially maintained by foundation now. The localization happens because of volunteers and volunteer maintaining Translate extension code. Moreover, the translatewiki.net, which hosts the Translate where localization happens by volunteers also outside foundation infrastructure.

3. It is time to have machine translation infrastructure within wikimedia. Content translation used machine translation - but that is an isolated product. Translation of content can be used in various contexts for readers. CX tries to provide a service api for MT, There are lot of potential for that. Multiple MT services, even proprietary services might be needed to cover all the languages. At the same time, our content and translations are important for training new opensource MT engines.

4. Wikipedia follows very traditional approach for typography and layout. Language team had limited webfont delivery to aim missing font issue, but too old code not got any updates in last 3 years or so. Language team plans to abandon that feature due to maintenance burden, but not happened and no team now owns it. Other than this a few wikis does common.css hacks to have customization of default fonts. Typography refresh attempt from reading team was for Latin. Every script has its own characteristics about font size, preferred font family sequences, line heights etc. Presenting knowledge in all these language wikis, in 2017 or later need serious thoughts about readability, typography and general aesthetic of wikipedia in a language compared to other websites in that language.

Mingli Yuan Machine Translation Next Steps for Languages and Cross Project Collaboration

Embracing a new era with only small language obstacles

Recent progress on neural machine translation gives us better translation results. The industry invests huge amounts of money in this area for a promising future. For the first time people can communicate with only small language obstacles. We should be prepared for this near future by evaluating our position and understanding the impact. Also we should seek new opportunities, and contribute to the trend.

Advice: (1) Cooperate with the industry to enhance our translation infrastructure (2) Continuously release our translation data as an open corpus (3) Evaluate the impact. For example, probably very radical, how about setting up one unified Wikipedia in the future?

Secondary

Author Tags Primary Session Secondary Sessions Position Statement
Niharika Kohli Communities, Cross-wiki, Gadgets, Lua, New Users, Templates, Tools Evolving the MediaWiki Architecture Next Steps for Languages and Cross Project Collaboration

Investing in our communities

This position statement captures my thoughts about why and how we should be investing in our communities. There are a lot of ways we can encourage and support them, that we currently don't. Prioritizing to build tools for our communities is a crucial step for long term survival of our projects.

It's fairly common knowledge how a lot of our communities suffer from toxicity. It's incredibly hard for newcomers to edit, to stick around and stay engaged in the midst of the existing toxicity in the community. The problem frequently also exists in smaller communities. Just recently, the English wikipedia community has pushed WMF into implementing ACTRIAL and preventing brand new users from being able to create articles on the site. These are signs that all is not well with our communities. If we envision a future with an active, thriving editor community 15 years from now, we've to become more aware of how our communities function and do more to support them than what we do today. The problems also exists on the technical side. Communities without technical resources lose out on gadgets, templates, editing toolbar gadgets and so on. The editors on these wikis are still forced to do a lot of things the hard way. Non-wikipedia projects are probably the worst affected. Quite often our software projects also cater to the bigger projects. Often just wikipedias. I am sure we can't solve everything but I'm sure we can try to help solve at least some of the problems. We can invest in better tools for new users to create articles, to edit and experiment with wikitext markup. We can build a better "on boarding" experience for new users. For example, English wikipedia currently has "Article Creation Wizard"(https://en.wikipedia.org/wiki/Wikipedia:Article_wizard) which is outdated, poorly maintained and very confusing a lot of times. We can think about a more standardized solution which would be useful across wikis. We can also try to showcase user contributions in a better way, to build user engagement. Various wikis have been striving to create and sustain "wikiprojects" since a while with the result that several big wikipedias have come up with their own homegrown solutions for it. These are things the Foundation can help with building and standardize it for all wikis. For the technical problems, there is a big backlog of projects which are long overdue. Global cross-wiki watchlists, Global gadgets, templates, lua modules have been asked for by the community since many many years now. There are a lot more such projects to be found on Phabricator and the wishlist survey. These are projects which can be building blocks in making our communities more sustainable and thriving places. They are big and important enough projects that should make it into the product roadmap of teams outside of Community Tech. Another important thing we should think about is tools. Some tools such as pageviews analysis is one of the most important volunteer-maintained tools out there. What happens when it stops being maintained? When is a tool important enough for the Foundation to start thinking about incorporating that functionality in an extension/core? These are all important discussions to be had.