Infrastructure

From devsummit

7 statements.

Author Tags Primary Session Secondary Sessions Position Statement
C. Scott Ananian Censorship, Infrastructure, Languages, Machine Learning, Machine Translation, Translation, User Experience Next Steps for Languages and Cross Project Collaboration Advancing the Contributor Experience

'One World, One Wiki!' Instead of today's many siloed wikis, separated by language and project, our goal should be to re-establish a unified community of collaborators. We will still respect language and cultural differences - there will still be English, German, Hebrew, Arabic, etc. Wikipedias; they will disagree at times - but instead of separate domains, we'll embrace a single user experience with integrated navigation between projects and languages and the possibility of split screen views aligning related content. On a single page we can work on articles in different languages, or simultaneously edit textbook content and encyclopedia articles. Via machine translation we can facilitate conversations and collaborations spanning languages and projects, without forcing a single culture or perspective.

Machine translation plays a key role in removing these barriers and enabling new content and collaborators. We should invest in our own engineers and infrastructure supporting machine translation, especially between minority languages and script variants. Our editing community will continually improve our training data and translation engines, both by explicitly authoring parallel texts (as with the Content Translation Tool) and by micro-contributions such as clicking yes/no on a proposed translation or pair of parallel texts ('bandit learning'). Using 'zero-shot translation' models, our training data from 'big' wikis can improve the translation of 'small' wikis. Every contribution further improves the ability of our tools to make additional articles from other languages available.

A translation suggestion tool will suggest an edit in one language whenever an edit is made to a parallel text in another language. The correspondences can be manually created (for example, via the Content Translation Tool), but our translation engine can also automatically search for and score potential new correspondences, or prune old entries when the translation has drifted. Again, each new correspondence trains the engine and improves its ability to suggest further correspondences and edits.

Red-links and stubs are replaced with article text from one of the user's preferred fallback languages, perhaps split-screened with a machine translation into the user's primary language. This will keep 'small' language wikis sticky, and prevent readers from getting into the habit of searching in a 'big' language first.

We should build clusters specifically for training translation (and other) deep learning models. As a supplement to our relationships with statistical translation tools Moses and Apertium, we should partner with the OpenNMT project for modern neural machine translation research. We should investigate whether machine translation can replace LanguageConverter, our script conversion tool; conversely, our editing fluency in ANY language pair should approach what LanguageConverter provides for its supported languages.

By embracing unity between projects and erasing barriers between languages, we encourage the flow of diverse content from minority languages around the world into all of our wikis, as well as improving the availability of all of our content into indigenous languages. Language tools route around cultural or governmental censorship: by putting parallel texts and translations in the forefront of our UX we expose our differences and challenge preconceptions, learning from each other.

Jan Dittrich Documentation, Gadgets, Infrastructure, JavaScript, Open Source, Research Evolving the MediaWiki Architecture Research, Analytics, and Machine Learning

I believe that we need to achieve a better separation of concerns - in code as well as in work on product and our communication with the communities to reduce the dept we build up in these areas. Therefore, I want to suggest three interrelated topics:

  • Use of modern MVVM frameworks for our front end code, to develop more efficiently
  • Provision of a modern customization infrastructure, to decouple gadgets from our code
  • Participation beyond code and feature wishes
  1. Use of modern MVVM frameworks for our front end code

Traditionally, Mediawiki has been focused on PHP. Over the last years, more and more interactivity via JavaScript has been used. In connection with the significant growth of the JavaScript ecosystem this could have meant quicker development and a clearer separation of concerns. However, since our solutions are mostly used in MediaWiki context, involving external developers has been difficult and so has been onboarding developers in core teams.

I see a large opportunity in introducing modern MVVM libraries that are open source and not constrained to use in Wikimedia software and could build upon other's experiences as well as documentation - things that have been traditionally problematic in our isolated MediaWiki solutions.

Strong contenders are React and vue.js. While the react ecosystem is larger, I would recommend vue for its better documentation and clear compartmental structure which hopefully helps us to avoid further isolated solutions.


  1. Provision of a modern customization infrastructure

The introduction and larger use of a MVVM could also be a chance to provide clear frontend APIs for Gadgets. They currently use DOM-hacks, which break continuously and would not anyway not possible when using a modern frontend framework (due to DOM flushing).

Why should bother, since we have a large user base in which different tasks are shared using specific tools, just like each manual work has many different specialized, often even customized tools.

Additionally, gadgets/userscripts could provide a low-barrier opportunity to onboard new developers. Other organizations successfully show that user provided extensions can enhance an ecosystem with user driven innovation and help with onboarding developers, e.g. Firefox' and Chrome's WebExtensions as well as LibreOffice.

I would like to work on finding a way fulfil the possibilites of gadgets and extend them while providing sustainable and secure infrastructure for doing so.


  1. Participation beyond code and feature wishes

We already do extensive user research. A large area for expansion and further development is doing this research and sense making *with* the community. This may already be done, often implicitly, based on feature- or UI focused requests of community members. But this has large caveats: The solution may net be feasible or sustainable to implement. Furthermore, without understanding the underlying need, we risk building technical- and UX debt and give away the possibility of learning from our community.

To achieve an active, needs-based involvement of communities in design and research we could build on existing participatory design methods. They could be used and integrated in our research and product planning frameworks. Clearly integrating community in up front research could enable us to gather needed knowledge, have community participation and reach a better understanding between Wikimedia Foundation and communities as well as of the communities among each other. I want to define future participatory design strategies to be used on our way towards 2030.

Eric Evans Architecture, Infrastructure, Strategy Evolving the MediaWiki Architecture

More than just servers

A modern approach to tooling and infrastructure is needed, not only so that we may scale future content and users, but the deployment of new technologies and services as well.

Our way of approaching infrastructure is in need of modernization; What we do we do well, but if we are to grow our capacity, not just for storing and serving content, but our capacity to create the technologies that empower movement strategies, we need a departure from the idea of infrastructre as merely clusters of servers. We need high-level, easily consumed platforms for computation, storage, deployment, and management. We need environments that make experiments cheap, and allow teams to fail-fast or iterate on the next stage quickly. We need systems that are distributed by nature, self-service, secure, and that are able to provide insight into availability and performance.

Some efforts have been (or are being) made here. Examples include recent work toward a Kubernetes deployment, a change-propagation service, and RESTBase. These efforts, while worthwhile, are piecemeal, and no holistic strategy exists. It is my belief that as we discuss the future of our platform, we should consider the requirements in the context of the bigger picture, and discuss a strategy aimed at modernizing our infrastructure.

Giuseppe Lavagetto Analytics, Data Center, Infrastructure, Job Queue, Multi-Datacenter, Performance Evolving the MediaWiki Architecture

The future of the MediaWiki infrastructure at the wikimedia foundation.

MediaWiki is at the core of the infrastructure that serves all of the Wikimedia projects, and the current setup of MediaWiki in production poses various challenges: from the future of our current runtime (HHVM), to the to ability to serve MediaWiki from multiple DataCenters, to long standing issues as resource usage efficiency and flexibility.

Here are some of the things we have to tackle in the future.

Transition off of HHVM:

Since the HHVM team has made it clear they're parting ways with full PHP compatibility, and that maintaining support for both HHVM and PHP in MediaWiki would be arduous, we need to make plans to move off of HHVM, back to PHP 7.x. This transition, while technically necessary, should not come at a cost for our users: page load times should not degrade. We can proceed by marking responses coming from either engine, collecting metrics and analyzing data. In order to achieve this, we should run the two runtimes in parallel on the same servers (which have plenty of capacity, given no MediaWiki cluster has an utilization over 40%), and we will then be able to programmatically route individual users or a percentage of traffic, or even specific wikis, to one or the other. The deadline for this transition is set for the end of 2018 (EOL of the last compatible version of HHVM), and planning and resources should be allocated to this goal.

Multi-Datacenter support:

We currently are using our datacenters in a active/passive setting, as far as MediaWiki is concerned. While this is ok in line of principle, this is a huge waste of resources and means we both have 50% of our servers doing nothing at all, and also limits our ability to expand the number of core datacenters we can use. Diverting the read load to secondary datacenters could also allow us to use caching in a less aggressive way when not needed. There is already a program underway to add first-class multi-DC support to MediaWiki, so we can focus on what specifically needs to be done in order to achieve this longtime goal: our final goal should be to serve reader's traffic from all datacenters, and to be able to switch the "master" datacenter in matter of minutes.

Elasticity, resource usage efficiency

At the moment , our infrastructure is plainly inadequate to react to sudden spikes of non-wiki content production and to changes that generate a lot of asynchrounous jobs, as a change of a popular template. The issues with the current jobqueue are widely known and publicized, but even the current transition to a new model won't solve the starvation of resources that result in a degraded user experience. Moreover, a single editor uploading videos via video2commons can easily overflow our media processing capacity for weeks. This happens because we allocate our resource statically (we have 4 vidoesclaers per datcenter, for example), we have an inefficient resource consumption, and reallocating servers requires time and effort. Modern applications stacks are elastic, meaning the operation of scaling up or down the capacity of a single cluster or functionality can be handled programmatically and/or manually whenever the need occurs, allowing the infrastructure to react to such changes. For economic and privacy/security reasons, Wikimedia doesn't make use of external cloud services, so the only way to achieve such flexibility is to build a serviceable infrastructure that can serve MediaWiki and any other project Wikimedia will support: the effort to do that is underway with the rollout of our Kubernetes-based IaaS in production. I think we should work, sooner than later, at moving the MediaWiki application stack (and maybe its semi-ephemeral caching) to the kubernetes platform. While the advantages of such an approach seem clear, it won't come without costs: specifically habits around code deployment, testing and configuration changes will need to be completely revisited and superseded by new approaches.

Timo Tijhof Infrastructure, Open Source, User Experience Embracing Open Source Software Evolving the MediaWiki Architecture
  • Embrace open-source and keep our software to the same standards we hold other open-source software. This would prevent our software from becoming isolated, hard to maintain, or hard to contribute to.
  • Scale the contributor experience. Ensure our content remains of high quality and value to readers; ultimately to avoid failing our mission. I envision this requires a radical shift in how our application is served, by involving a non-static service capable of scaling to the traffic of our CDN and yet vary responses by user.

Open-source:

We must understand the dangers of producing software that isn't reusable. Such software may be hard to maintain, hard to contribute, both for future contributions, and our future selves.

"Current needs" only exist to serve our long-term needs. Losing track of long-term needs can make software too specific to a current need, risking a trend of releasing software that is only open-source as a courtesy, for transparency, and without being re-usable.

Reusable software has a defined purpose and serves it well. It tends be easy to install, well-documented, and easy to contribute to. Re-use between different services, as well as externally. Such as for community tools, cloud services, or other third parties. Having a defined goal also encourages designing APIs in a way that we can agree not to break or change too often, because they are public.

Scaling:

Our current infrastructure is highly optimised for the passive reader that doesn't contribute. We serve a static CDN response to most users. For users having logged-in, or made contributions, we bypass these layers for all page loads. As a result, their document load time increases by 5x-10x (eg. NavTiming metric responseStart).

In 2015, Ori mentioned the danger of this in (<https://blog.wikimedia.org/2014/12/29/how-we-made-editing-wikipedia-twice-as-fast/>), saying optimising our backend will "allow us to dissolve the invisible distinction between passive and active users". And "enable microcontribution [features] that draw [in] passive readers".

Banners (CentralNotice) are a good example of our needs being at odds with our infrastructure. We want banners to show as part of the page, and for banners to vary by user, location, plus random variance. Our current infrastructure could only do so by bypassing the CDN on all requests. As such, the current way is entirely client-side and completes well after page load.

In few cases where we do ask readers for data, it is for statistical purposes or to improve the software. Direct (or indirect) contributions to our content remains limited to complex actions like "edit". Moving our contributors experience to match some of the capabilities and performance of the reader experience, would enable us to start accepting micro contributions that actually produce a change in content (either directly, or e.g. by consensus). It also opens the door to making our web platform work offline (e.g. ServiceWorkers) which further enables high-performant interactions that can be uploaded at a later time.

Gergő Tisza Collaboration, Hosting, Incubator, Infrastructure, Innovation Supporting Third-Party Use of MediaWiki

We need a wikiproject incubator to enable innovation

We need infrastructure for cheap creation of new wikis and experimentation with new project types and their different technical and social needs.

Wikipedia was an unlikely bet that paid off extremely well. It does not cover the entirety of human knowledge however - it's limited to long-form, text-based representations of encyclopedic topics on subjects well covered by reliable sources. There have been recurring discussions in the past about covering new types of knowledge (genealogy, fact checking, 3D models, oral history, collaborative video etc. etc.) but they never went anywhere, because creating new projects is a lot of work and the outcome is unpredictable (some things only work in practice), so the WMF was unwilling to take the risk. If we want to make meaningful progress on our vision of curating the sum of *all* knowledge, and truly become the infrastructure of free knowledge, we need to overcome this barrier.

We need infrastructure for low-effort, low-risk experimentation with new projects - a "wiki nursery". A system where new wikis can be created at minimal cost, technical and social experiments can be performed on them without undue risk to other projects, and they can be discarded if they prove unsuccessful. Other organizations have used such mechanisms with great success (Wikia has tens of thousands of wikis; Stack Exchange has nearly two hundred sites).

Such a system has to be somewhat separate from the existing wiki cluster; a closely integrated approach like incubator.wikimedia.org is both too inflexible and too insecure. It needs to have some level of operational and legal maturity (more than what Cloud VPS offers). For risk segregation and eficiency, it would have to be different from our production cluster in a number of ways:

  • single sign-on which does not rely on the local wiki for authentication
  • no shared database and configuration acces
  • ensure even distribution of server resources without constant human supervision
  • flat domain structure (to avoid cookie leaks) with affordable certificate management
  • new management interfaces and corresponding core support to avoid spending developer time on common tasks (wiki creation and destruction, configuration and extension management)

While there is a significant one-time cost to setting up such a system, keeping it running is cheap, and the benefits are well worth it - information on what content types and what policies enable productive collaboration, a space free of the conservativism and change aversion of large established projects where new concepts can prove themselves, an entry point for projects with a less western-centric interpretation of knowledge, lots of small projects which are welcoming to new editors and provide lots of low-hanging fruit for contribution. (Also, the same feature set is needed for paid wiki hosting service, which is one way to create non-WMF income streams into the MediaWiki software ecosystem - a valid strategic goal on its own.)

Leila Zia Infrastructure, Knowledge as a Service, Knowledge Equity, Languages, Oral Knowledge, Research, Strategy, Trust Knowledge as a Service Research, Analytics, and Machine Learning

Title: Knowledge is our direction. What's next?

Combined knowledge as a service (KAS) and knowledge equity (KE) is identified as our strategic direction (draft). We have decided to focus on knowledge in a broader sense and beyond just encyclopedic knowledge, create KE, and become the infrastructure that offers KAS. In this position paper, I offer some of my early thoughts on where we should focus our efforts to move in this strategic direction. Given the limits of word-count, I will not go through the details of research methods and techniques that can be used to address each point.

Knowledge

As the central focus of the strategic direction is knowledge, we need to arrive at a unified working definition of knowledge. English Wikipedia defines knowledge as familiarity, awareness, or understanding of someone or something which is acquired through experience or education, by perceiving, discovering, or learning. This definition, however, is not a working definition that can help us decide what new content to include.

Research on user behavior, needs, and learning patterns can help us define knowledge.

Knowledge equity

Our goal is to remove structural inequalities that limit our ability to represent knowledge from all people and by all people. To this end, we need to meet our users where they are. Today:

  • language is a barrier to sharing in knowledge. Content should be available to our users in their languages.
  • text-only knowledge is a blocker for gathering knowledge, especially from parts of the world that are already left behind. Our systems should become technologically receptive to accepting and allowing editability of new forms of knowledge (e.g., voice for oral knowledge).
  • limits in proficiency and literacy is a blocker for our users. The content and its presentation will need to become a function of these parameters.

Knowledge as a service

Our goal is to offer KAS: both in terms of the infrastructure that supports it as well as the content of it. To do this, we need to:

  • empower our users to learn, create, and go beyond consuming content: Wikimedia projects' talk and discussion pages are an asset for building systems that can help our users think critically and learn how to deliberate. We need to do research to surface this critical thinking and step by step deliberation to gain insights from it, and share it with others as part of our KAS effort.
  • do research and development on building systems where deliberation and decision making can be possible at scale. Today, there is no such system available but one of the building blocks of KAS is infrastructure for discussion, deliberation, and decision making.
  • empower our users with ways to assess the trustworthiness of the content. Trust and reputation become especially important as we move to new forms of knowledge such as oral knowledge. We should do research to build trust and reputation models for Wikimedia and its users and understand how to surface such metrics as measures of reliability of the knowledge we serve.