Censorship

From devsummit

3 statements.

Author Tags Primary Session Secondary Sessions Position Statement
C. Scott Ananian Censorship, Infrastructure, Languages, Machine Learning, Machine Translation, Translation, User Experience Next Steps for Languages and Cross Project Collaboration Advancing the Contributor Experience

'One World, One Wiki!' Instead of today's many siloed wikis, separated by language and project, our goal should be to re-establish a unified community of collaborators. We will still respect language and cultural differences - there will still be English, German, Hebrew, Arabic, etc. Wikipedias; they will disagree at times - but instead of separate domains, we'll embrace a single user experience with integrated navigation between projects and languages and the possibility of split screen views aligning related content. On a single page we can work on articles in different languages, or simultaneously edit textbook content and encyclopedia articles. Via machine translation we can facilitate conversations and collaborations spanning languages and projects, without forcing a single culture or perspective.

Machine translation plays a key role in removing these barriers and enabling new content and collaborators. We should invest in our own engineers and infrastructure supporting machine translation, especially between minority languages and script variants. Our editing community will continually improve our training data and translation engines, both by explicitly authoring parallel texts (as with the Content Translation Tool) and by micro-contributions such as clicking yes/no on a proposed translation or pair of parallel texts ('bandit learning'). Using 'zero-shot translation' models, our training data from 'big' wikis can improve the translation of 'small' wikis. Every contribution further improves the ability of our tools to make additional articles from other languages available.

A translation suggestion tool will suggest an edit in one language whenever an edit is made to a parallel text in another language. The correspondences can be manually created (for example, via the Content Translation Tool), but our translation engine can also automatically search for and score potential new correspondences, or prune old entries when the translation has drifted. Again, each new correspondence trains the engine and improves its ability to suggest further correspondences and edits.

Red-links and stubs are replaced with article text from one of the user's preferred fallback languages, perhaps split-screened with a machine translation into the user's primary language. This will keep 'small' language wikis sticky, and prevent readers from getting into the habit of searching in a 'big' language first.

We should build clusters specifically for training translation (and other) deep learning models. As a supplement to our relationships with statistical translation tools Moses and Apertium, we should partner with the OpenNMT project for modern neural machine translation research. We should investigate whether machine translation can replace LanguageConverter, our script conversion tool; conversely, our editing fluency in ANY language pair should approach what LanguageConverter provides for its supported languages.

By embracing unity between projects and erasing barriers between languages, we encourage the flow of diverse content from minority languages around the world into all of our wikis, as well as improving the availability of all of our content into indigenous languages. Language tools route around cultural or governmental censorship: by putting parallel texts and translations in the forefront of our UX we expose our differences and challenge preconceptions, learning from each other.

Ariel Glenn Censorship Advancing the Contributor Experience

Think like a Pirate: How to beat Internet censorship

Universal access to a digital good such as the knowledge curated and made available via Wikimedia projects, presupposes access without censorship.

Censorship and circumvention methods become more advanced over time. Censorship ranges from blocks of single articles to targeting DNS providers to seizing servers to shutting off Internet access completely. Some of these methods are in use right now against Wikimedia projects.

One form of censorship evasion has proven virtually impossible to stamp out: piracy of copyrighted content, in particular music and movies. Let's look at the methods used by the pirates and adapt them for use by Wikimedia content providers and users. We would like our content to be widely shared, available everywhere. Here is what we need to get started:

1. Content must be downloadable and usable off-line.

  * Content meant to be used online, that requires contact with an external server, fails this test. Movies and music do not.

2. Content must be partitionable.

  * You don't grab all alternative music for 2017, but just the albums from the artists you want.  Users will likely not need or want all of the English language Wikipedia (for example) but only subsets.

3. Content must be usable off-line by applications everyone has.

  * Movies and music are downloaded in formats that play in apps that come standard with every OS on every platform.  Usability must include navigation and search of content.

4. Downloadable content must be easy to find, both before and after censorship.

  * You ask Google to find the music or movie you want on YouTube or elsewhere, click and download. Failing that, there is a fallback (see below).

5. Tech-savvy downloaders must be able to seed the distribution of content to everyone else.

  * For music or movies, folks who download from private torrent trackers make copies to give to all their friends; six degrees of separation later, we have reached saturation.

6. Content must be popular enough to be widely shared.

  * If a group of consumers cannot locate a content source or redistributor, the distribution chain breaks.  Poorly seeded torrents are the classic example.

7. People must not rely solely on the original online content source for access.

  * If no one has downloaded or mirrored a copy before access to the original content source is blocked, this approach fails.  Note that most people will have little incentive to save copies of content for offline use from a reliable site, unless Internet access itself is spotty, or the content bundles for download add value.

In some jurisdictions, it may be dangerous to possess certain content, including that of the Wikimedia projects. This issue is outside the scope of this proposal.

Related topics: https://www.mediawiki.org/wiki/Wikimedia_Apps/Offline_support, http://www.kiwix.org/, http://xowa.org/ and so on

Brian Wolff Censorship, Offline Editing, Synchronization Advancing the Contributor Experience

Wikimedia should diversify its distribution methods.

Currently Wikimedia distributes its content almost exclusively using the Internet. However, the Internet is controlled by gate keepers in the form of governments and ISPs. While historically these entities rarely controlled the flow of information, more recently we have seen an increase in censorship, particularly by governments. Since Wikimedia is distributed almost entirely over the Internet, we are vulnerable to their whims.

The risk of having our distribution lines interfered with, is an existential threat to our mission. While at present time, only a few geographic locations practise such interference, the future is unknowable and does not appear to be heading in a comforting direction.

Furthermore, in the face of such interference, there is very little we can do. TOR is often spoken as a solution to censorship, but any such on-Internet system will either have to be obscure or rely on secret information (e.g. TOR bridges) to avoid blocking, and thus cannot be used by the public at large. The most effective solution to censorship so far seems to be political pressure, combined with bundling to make censorship decisions as broad as possible. When much content is bundled together, such as entire domains with TLS, or Github and New York Times[1], it can reduce censorship if there is political will to censor a specific part, but not the whole thing. However, political opinion is fickle, and cannot be relied upon.

Thus, we should reduce this risk by diversifying how we distribute our content. Multiple distribution routes means no single point of failure. I see two ways of doing this:

First, by expanding offline versions of Wikimedia. Kiwix already provides an offline version of Wikimedia sites. We need to expand this capability to allow for better updating. Offline apps should be able to efficiently update their contents in accordance to a scenario where users only have intermittent access to the open Internet. More importantly, offline apps should be able to update in a P2P fashion with other apps. In a community with limited access to open Internet, a single person with an up to date version of Wikipedia, should be able to easily synchronise his/her app with other people's apps to spread the knowledge. This could be especially helpful in a scenario where a small number of people have access via methods such as TOR, but such methods are too burdensome for most people.

Second, we could experiment with broadcasting recent edits widely. To broadcast html versions of all main namespace pages recently edited on English Wikipedia, would only require about 12 KBps [2]. This is not a huge amount of bandwidth. During the Cold War it was common to broadcast propaganda using short wave radio, which could be listened to across the world. Perhaps we could broadcast everything that is edited across the world in a similar fashion, allowing users to stay up to date regardless of their connectivity. This could be combined with the P2P app, so a few power users could listen in to the RC stream, and then spread the data among their communities.

[1] https://en.wikipedia.org/wiki/Censorship_of_GitHub#DDoS_attack [2] Based on very rough experiment, ?action=render of a wikipedia page roughly gzips to the size of the raw wikitext. From there the 12 KBps number is based on the enwiki result of: SELECT sum(l)/(1024*3600*24) FROM

(select max(rc_new_len) 'l' from recentchanges
 WHERE  rc_namespace = 0 and rc_timestamp
 BETWEEN '20170926000000' AND '20170927000000'
 AND rc_type <= 1 group by rc_cur_id
) t;