Research, Analytics, and Machine Learning

From devsummit
Related Phabricator Task T183320
Topic Leaders Leila Zia, Dan Andreescu, Aaron Halfaker

4 primary statements. 7 secondary statements.

  • Analytics
  • Architecture
  • Artificial Intelligence
  • Big Data
  • Collaboration
  • Contributors
  • Documentation
  • Gadgets
  • Hosting
  • Infrastructure
  • JavaScript
  • Knowledge as a Service
  • Knowledge Equity
  • Languages
  • Machine Learning
  • Mobile
  • Open Source
  • Oral Knowledge
  • Privacy
  • Research
  • Strategy
  • Structured Data
  • Templates
  • Third Parties
  • Tools
  • Translation
  • Trust
  • Wikibase
  • Wikidata


Author Tags Primary Session Secondary Sessions Position Statement
Dan Andreescu Analytics, Big Data, Collaboration, Documentation, Open Source, Research, Third Parties Research, Analytics, and Machine Learning Embracing Open Source Software

Our strategic goals include scaling our communities to a truly global level, and expanding our understanding of human knowledge. To do this, in my opinion, we need to have a much better understanding of our communities' actual work. We have tens of thousands of people doing millions of hours of work every month, and nobody knows exactly what is being done, what the definition of "done" is, and how fast or slow the progress is. We are the leaders of the free knowledge movement, and we are mostly blind except for some big picture notions like pageviews and edits. It is my opinion that we need to develop a good understanding of the work being done on the wikis. Very capable people have already spent lots of time trying to do this, but I believe we have largely failed because of technical limitations. This is a big data and big compute problem, and we have not yet approached it as such. A close collaboration between our communities, Analytics, Research, and Audiences teams is needed, as well as the power of the WMF Hadoop cluster. I have had sessions on this topic already, and am excited to finish planning and transition to actual work. There are some very valuable implications of taking on and finishing this work. Most importantly, we will all be able to more objectively talk about frustrations in the community over changes that cause "more work". For example, when we launched Visual Editor there was huge backlash about the amount of work this change implied for our community. But because this was largely based on subjective opinions, emotions got involved and it took years to calm the negative effect of those emotions. This effort would also give us, for the first time, a way to celebrate these millions of hours of work. People could see, share, and take pride in their part of building human knowledge (if they wanted to, privacy is of course one of our top priorities).

I am also interested in expanding our Open Source efforts, and examining changes that we can make to spur more collaboration. My reading of the strategic goals for 2030 is that the WMF will not have enough resources to execute by itself. That's where collaboration will be crucial, and where problems like in-house developed libraries without true Open Source presence will slow us down. We let documentation and third-party user support lag behind because we're busy with other stuff, and that's arguably fine for our scale so far. But this approach will not allow us to grow the way our Strategy is defined.

Erik Bernhardson Analytics, Collaboration, Machine Learning, Open Source, Privacy, Structured Data Research, Analytics, and Machine Learning

Title: Empowering Editors with Machine Learning

Background: Advances in machine learning, powered by open source libraries, is becoming the foundational backbone of technology organizations the world over. Many tedious, time consuming, tasks that previously required 100% human involvement can now be augmented with human in the loop machine learning to empower editors to get more done with the limited time they have available to contribute to the sum of all human knowledge.

Advice: 1) Invest directly in applying known quantity machine learning, such as pre-trained ImageNet classifiers, to add structured data to our multimedia repositories to increase their discoverability. Perhaps via tools that provide editors with lists of appropriate items that they can easily click to add if appropriate to the multimedia.

2) Engage academia to work with Wikimedia data sets and employ developers to move the most promising results from research into production. There is already a significant amount of work being done in academia to test and evaluate machine learning with our data sets, but little to none of that work ever makes it back into Wikimedia sites. With more focus on collaboration we can encourage research that is specifically applicable to deployment goals.

3)Wikimedia has the ability to collect significant amounts of implicit user data via browsing sessions, searches, watchlists, editing histories, etc. that can be used for machine learning purposes. We need to be continuously thoughtful of the privacy implications of how we use this data.

Aaron Halfaker Artificial Intelligence Research, Analytics, and Machine Learning

The future of responsible AI design is auditing systems. When we deploy AIs that make inherently subjective judgments that affect people and their work, we must also provide a means for them to audit and critique the AI. Did the AI mark the wrong thing as vandalism? Then it can silence a contribution. Did the AI fail to note a high quality article? Then we might direct traffic away from good content? Did the AI recommend the wrong type of thing? Then we might keep people in a filter bubble rather than helping them broaden their knowledge.

There's a lot of conversation in the public sphere about how AIs cause ethical and social problems. Google's image search suggests that all CEOs are men. Facebooks feed filter reduces the visibility of conflicting opinions. The general call among researchers and ethicists is for transparency. At Wikimedia, transparency is an old idea. We've always developed our technologies transparently. But this hasn't made us immune from the problems that AIs wreck. Auditing systems are the future. They are a means towards giving users power over the AIs that govern our experience. We should be talking about how to build them.

Amir Sarabadani Machine Learning Research, Analytics, and Machine Learning

Machine learning and scaling the knowledge

There are lots of ways to contribute to Wikimedia movement and all are tedious and time-consuming, and sometimes frustrating. Let's use the classic example of fighting vandalism, Wikimedians are frustrated by flow of vandalism but at the same time they enjoy fighting it. By using scalable machine learning platform for Wikis we are moving towards giving more power to our users without making them tired of the work needed. To come back to our example, ORES is filtering out most good edits, leaving the rest to the community to take care of (so they enjoy the gratification of doing the job) at the same time handling the huge backlog of edits needing review. But if we use a bot to revert bad edits, it damages the motivation of the users. We need to continue moving at this direction in other areas like creating articles, improving quality, categorization, and so on.


Author Tags Primary Session Secondary Sessions Position Statement
Adam Baso Machine Learning, Mobile,, Structured Data, Templates, Wikibase, Wikidata Knowledge as a Service Advancing the Contributor Experience, Research, Analytics, and Machine Learning

Structure Most Things with

The future of digital information will likely be brokered by major platform providers such as Google, Apple, Amazon, Microsoft, and international equivalents and social networks. We're thankful they extend our reach, even as we seek to help consumers on the platforms join our movement.

We could help platform providers, their users, and our users solve problems better through adoption of the open standard into Wikipedia pages mapped with templates and, ideally, federated and synchronized Wikidata properties.


  1. Wikipedia will have even better presentation and placement in search engines and other data rich experiences.
  2. We provide an opportunity for a more consistent data model for template authors and people/bots filling template values. And the richly defined entities provide a good target to reach on all entities represented in the Wikipedia/Wikimedia corpora. Standardization can reduce duplication of effort and inconsistencies.
  3. We introduce an easier vector for mobile contribution, which could include simpler and different data entry, mapping, and modeling.
  4. We can elevate an open standard and push its adoption forward while increasing the movement's standing in the open standards community.
  5. compliant data is more easily amenable to machine learning models that cover data structures, the relations between entities, and the dynamics of sociotechnical systems. This could bolster practical applications like vandalism detection, coverage analysis, and much more.
  6. This might provide a means for the education sector to educate students about knowledge creation, and data modeling, and more. It might also afford scientists and other practitioners a further standardized way to model the knowledge in their fields.

What would it take? And can this be done in harmony with the existing {{Template}} system?

This session will discuss the following:

  1. Are we aligned on the benefits, and which ones?
  2. Implementation options.
    1. Can we extend templates so they could be mapped to
      1. Would it be okay to derive the mapping by manual and automated analysis at WMF/WMDE and apply it behind the scenes? Would that be sustainable?
      2. Could we make it easy for template authors to mark up their templates for compatibility and have some level of enforcement? Could attributes and entity types be autosuggested for template creators?
    2. Is it easy to relate the most existing and proposed Wikidata entity types and properties to existing entities and properties?
    3. What would it take to streamline MCR data structures or MCR Wikibase property clusters mapped to on defined entity types?
    4. Furthermore, if we can do #1 and #2, what's to prevent us from letting templates as is merely be the interface for compliant Wikibase entities and properties (e.g., by duck typing / autosynthesis)?
    5. How could we bidirectionally synchronized between Wikipedia and Wikidata with confidence in a way compatible with patroller expectations? And what storage and event processing would be needed? Can the systems be scaled in a way to accommodate arrival of real-time and increasingly fine grained information?
Jan Dittrich Documentation, Gadgets, Infrastructure, JavaScript, Open Source, Research Evolving the MediaWiki Architecture Research, Analytics, and Machine Learning

I believe that we need to achieve a better separation of concerns - in code as well as in work on product and our communication with the communities to reduce the dept we build up in these areas. Therefore, I want to suggest three interrelated topics:

  • Use of modern MVVM frameworks for our front end code, to develop more efficiently
  • Provision of a modern customization infrastructure, to decouple gadgets from our code
  • Participation beyond code and feature wishes
  1. Use of modern MVVM frameworks for our front end code

Traditionally, Mediawiki has been focused on PHP. Over the last years, more and more interactivity via JavaScript has been used. In connection with the significant growth of the JavaScript ecosystem this could have meant quicker development and a clearer separation of concerns. However, since our solutions are mostly used in MediaWiki context, involving external developers has been difficult and so has been onboarding developers in core teams.

I see a large opportunity in introducing modern MVVM libraries that are open source and not constrained to use in Wikimedia software and could build upon other's experiences as well as documentation - things that have been traditionally problematic in our isolated MediaWiki solutions.

Strong contenders are React and vue.js. While the react ecosystem is larger, I would recommend vue for its better documentation and clear compartmental structure which hopefully helps us to avoid further isolated solutions.

  1. Provision of a modern customization infrastructure

The introduction and larger use of a MVVM could also be a chance to provide clear frontend APIs for Gadgets. They currently use DOM-hacks, which break continuously and would not anyway not possible when using a modern frontend framework (due to DOM flushing).

Why should bother, since we have a large user base in which different tasks are shared using specific tools, just like each manual work has many different specialized, often even customized tools.

Additionally, gadgets/userscripts could provide a low-barrier opportunity to onboard new developers. Other organizations successfully show that user provided extensions can enhance an ecosystem with user driven innovation and help with onboarding developers, e.g. Firefox' and Chrome's WebExtensions as well as LibreOffice.

I would like to work on finding a way fulfil the possibilites of gadgets and extend them while providing sustainable and secure infrastructure for doing so.

  1. Participation beyond code and feature wishes

We already do extensive user research. A large area for expansion and further development is doing this research and sense making *with* the community. This may already be done, often implicitly, based on feature- or UI focused requests of community members. But this has large caveats: The solution may net be feasible or sustainable to implement. Furthermore, without understanding the underlying need, we risk building technical- and UX debt and give away the possibility of learning from our community.

To achieve an active, needs-based involvement of communities in design and research we could build on existing participatory design methods. They could be used and integrated in our research and product planning frameworks. Clearly integrating community in up front research could enable us to gather needed knowledge, have community participation and reach a better understanding between Wikimedia Foundation and communities as well as of the communities among each other. I want to define future participatory design strategies to be used on our way towards 2030.

James Forrester Research, Strategy, Tools Evolving the MediaWiki Architecture Research, Analytics, and Machine Learning

Fundamentally, Wikimedia's technology are tools to achieve our mission - absolutely vital tools, but not objectives in themselves. Where a tool has dulled we should sharpen it, where it has rusted we should polish it, and where it has blunted we should replace it.

The majority of our tools have sprouted over time in response to immediate needs, and grown ad hoc when we've spotted something they can also do, or been pruned back when they proved too unwieldy to retain. Our communities have taken these tools and built amazing things with them, often despite rather than in line with their intended use. Subsequently these unplanned use patterns have shaped what we think about the tools and how they should be used, when we do so.

This haphazard, tactical development has worked well enough, but has limited us in several ways. We often fail to serve some of our audience because we rush in with a quick fix that listens to a few voices and decides that that's the best thing to build. When we've tried to build more systemic change, it's often been unrooted in serious evidence, and so is like constructing ivory towers into the clouds: baffling, hopeless, and unfamiliar.

We should develop comprehensive methods to collect and monitor actionable data on how well our tools are serving their purposes, and where we can improve. This should come from all stakeholders, covering our great, already-empowered, experienced editors in major languages but also those from whom we rarely hear - those contributing in and speaking smaller languages or not interacting with other users on meta-editing issues, and those with a looser relationship to the movement like readers and casual editors.

We should have numbers clearly attached to our tools as to how we expect them to perform. How these are obtained will differ. Sometimes quick numbers like success rates of false positives against false negatives from anti-abuse features, or how many users having made changes try to press the submit button, will work. Sometimes simple surveys with expected happiness thresholds will be appropriate. In others we may need to work harder to come up with the right way to understand how different tools and experiences interact with each other, like how much "knowledge" readers successfully glean from the article, or whether the burden of allowing logged-out editing is worth the mindshare of "anyone can edit" feeling true.

Ideally, changes to user features and especially introductions of new features should progressively roll out based on these numbers - and if they have adverse effects, they should be automatically rolled back. This is how others operate, but it's very distant from today. It's a far-off dream now, but I believe we can build it.

Lucie-Aimée Kaffee Languages, Machine Learning, Translation, Wikidata Next Steps for Languages and Cross Project Collaboration Research, Analytics, and Machine Learning

Languages in the world of Wikimedia

One of the central topics of Wikimedia's world is languages. Currently, we cover around 290 languages in most projects, more or less well covered. In theory, all information in Wikipedia can be replicated and connected, so that different culture's knowledge is interlinked and accessible no matter which language you speak. In reality however, this can be tricky. The authors of [1] show, that even English Wikipedia's content is in big parts not represented in other languages, even in other big Wikipedias. And the other way around: The content in underserved languages is often not covered in English Wikipedia. A possible solution is translation by the community as done with the content translation tool [2]. Nevertheless, that means translation of all language articles into all other languages, which is an effort that's never ending and especially for small language communities barely feasible. And it's not only all about Wikipedia- the other Wikimedia projects will need a similar effort! Another approach for a better coverage of languages in Wikipedia is the ArticlePlaceholder [3]. Using Wikidata's inherently multi- and cross-lingual structure, AP displays data in a readable format on Wikipedias, in their language. However, even Wikipedia has a lack of support for languages as we were able to show in [4]. The question is therefore, how can we get more multilingual data into Wikidata, using the tools and resources we already have, and eventually how to reuse Wikidata's data on Wikipedia and other Wikimedia projects in order to support under-resourced language communities and enable them to access information in their language easier. Accessible content in a language will eventually also mean they are encouraged to contribute to the knowledge. Currently, we investigate machine learning tools in order to support the display of data and the gathering of new multilingual labels for information in Wikidata. It can be assumed, that over the coming years, language accessibility will be one of the key topics for Wikimedia and its projects and it is therefore important to already invest in the topic and enable an exchange about it.

[1] Hecht, B., & Gergle, D. (2010, April). The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 291-300). ACM. [2] [3] [4]

Daniel Kinzler Architecture, Collaboration, Hosting, JavaScript, Mobile, Strategy Evolving the MediaWiki Architecture Research, Analytics, and Machine Learning

I would like to discuss how assumptions drive our day to day work, and how to may sure we properly understand and regularly challenge these assumptions. I'm particularly interested in how technological assumptions shape product decisions, and how product assumptions shape technological decisions. Three major axioms come to mind:

MediaWiki needs to run in a shared hosting environment. This has been an explicit requirement for a long time now, but the baseline product that actually does run in such an environment (LAMP with no root access) is becoming more and more sub-par. We are already struggling to provide a decent mobile browsing experience there, not to mention search or WYSIWYG editing. So we should have a discussion about for how long we want to kep this requirement, what the consequences would be of dropping it, and what alternative platform we should target for the baseline installation of MediaWiki.

Editing has to work with old browsers and without JavaScript. It has long been an explicit requirement that no basic functionality, particularly editing, can require JavaScript to be enabled. However, this causes us to fall behind other sites further and further. With more and more sites requiring JS, it's becoming less and less clear to me that this requirement is still sensible. This is especially true in the light of many developing countries skipping straight from mostly-offline to mobile-only.

The primary medium for knowledge sharing is text. This assumption used to be hard-coded into MediaWiki until the introduction of ContentHandler, and it still seems to be hard coded in the minds of many long term contributors, to the software and to the wikis. I believe that it is high time to invest into exploring other media formats and alternative forms of collaboration. It seems to me like "Beyond Wikitext" is the major technological challenge that has come out of the movement strategy process, and that we should start thinking and talking about it - from the technological side as well as the product side.

Keerthana S Contributors, Machine Learning Advancing the Contributor Experience Research, Analytics, and Machine Learning

Breaking the ice and catering to the could be Student Wikipedia contributors

Most of the valuable contributors especially in technically advanced articles comes from people in academia. So my paper is going to discuss why it can be valuable to expose University students about contributing to Wikipedia and give enough guidance for them to stick around, existing infrastructure that helps this cause and some points on how this can improve. As the infrastructure of mediawiki evolves and becomes a platform where beginners to Open Source find it easy to contribute to the project with a really well documented code base, a friendly community and the many outreach programs we should also think about introducing to the University Students about contributing to WIkipedia.

Wikipedia serves as an invaluable tool for students worldwide helping them to assimilate their course content. They write term papers as a part of their course work so it only makes sense that giving an awareness to students about contributing wikipedia and giving them guidance can be a source of reliable and high quality contributions to Wikipedia.

Existing Infrastructure

WikiEduDashboard which is a project of the WikiEducation Foundation caters to universities where students are required to contribute to Wikipedia articles as part of their course assessment and provides tools for the instructors to guide the students in it.

Machine Learning tools for Guidance

There are existing automatic mechanisms in wikipedia to find out plagiarism/promotional content or any form of spam in the edits. Automatically rating wikipedia articles is to an extent achieved by the Scoring Platform Team. This is being utilised by many bots in wikipedia to spot potential vandalism. This score prediction tool can also be used to give some immediate feedback to the newbie editors in a more friendly manner and point out to the faux paus in their edits.

Leila Zia Infrastructure, Knowledge as a Service, Knowledge Equity, Languages, Oral Knowledge, Research, Strategy, Trust Knowledge as a Service Research, Analytics, and Machine Learning

Title: Knowledge is our direction. What's next?

Combined knowledge as a service (KAS) and knowledge equity (KE) is identified as our strategic direction (draft). We have decided to focus on knowledge in a broader sense and beyond just encyclopedic knowledge, create KE, and become the infrastructure that offers KAS. In this position paper, I offer some of my early thoughts on where we should focus our efforts to move in this strategic direction. Given the limits of word-count, I will not go through the details of research methods and techniques that can be used to address each point.


As the central focus of the strategic direction is knowledge, we need to arrive at a unified working definition of knowledge. English Wikipedia defines knowledge as familiarity, awareness, or understanding of someone or something which is acquired through experience or education, by perceiving, discovering, or learning. This definition, however, is not a working definition that can help us decide what new content to include.

Research on user behavior, needs, and learning patterns can help us define knowledge.

Knowledge equity

Our goal is to remove structural inequalities that limit our ability to represent knowledge from all people and by all people. To this end, we need to meet our users where they are. Today:

  • language is a barrier to sharing in knowledge. Content should be available to our users in their languages.
  • text-only knowledge is a blocker for gathering knowledge, especially from parts of the world that are already left behind. Our systems should become technologically receptive to accepting and allowing editability of new forms of knowledge (e.g., voice for oral knowledge).
  • limits in proficiency and literacy is a blocker for our users. The content and its presentation will need to become a function of these parameters.

Knowledge as a service

Our goal is to offer KAS: both in terms of the infrastructure that supports it as well as the content of it. To do this, we need to:

  • empower our users to learn, create, and go beyond consuming content: Wikimedia projects' talk and discussion pages are an asset for building systems that can help our users think critically and learn how to deliberate. We need to do research to surface this critical thinking and step by step deliberation to gain insights from it, and share it with others as part of our KAS effort.
  • do research and development on building systems where deliberation and decision making can be possible at scale. Today, there is no such system available but one of the building blocks of KAS is infrastructure for discussion, deliberation, and decision making.
  • empower our users with ways to assess the trustworthiness of the content. Trust and reputation become especially important as we move to new forms of knowledge such as oral knowledge. We should do research to build trust and reputation models for Wikimedia and its users and understand how to surface such metrics as measures of reliability of the knowledge we serve.