Skip to content

Human-powered data fusion: round trip (in/ex)ternal data reuse in Wikipedia

April 6, 2011

Wikipedia is the world’s biggest source of collaboratively edited knowledge. People produce, curate and update information that may be as accurate as Britannica Enciclopedia. These people’s contributions are very valuable, and we should respect their willingness to contribute by maximizing the value of each contribution they make. There are tons of data already available on the Web from trustworthy sources. See for example the Open Government Data movement. Why make Wikipedians copy+paste that information, then cite the source to convince other Wikipedians?

“In future, it may be possible to remove the need for a human to populate some parts of Wikipedia altogether, says Möller. ‘Fundamentally a lot of this data probably shouldn’t be entered by humans in the first place, it should just, say, poll the source of a figure like GDP once a year.’ That’s a capability that Koren has already added to Semantic MediaWiki, through an extension called ExternalData.”

“The External Data extension allows MediaWiki pages to retrieve, filter, and format data from one or more sources. In addition to external URLs, data can also come from a regular wiki page, an uploaded file, an LDAP directory, or a relational database.”

For example, in order to import values from a CSV file generated by an external website:

|format=csv with header
|data=bordered countries=borders,population=population,area=area,capital=has capital}}

And in order to use the imported values:

* Germany has area {{#external_value:area}}.


With a tool like this, Wikipedians can start using their precious time to do something much better than copy+paste. They can focus on creating data that is not yet available, or they can curate/verify the data that is automatically imported. Suppose that a given government’s open data inadvertently reports outdated information: Wikipedians could mark-up the given #external_value, indicating a new (corrected) value and a source to back up their claim.

This would provide a round-trip integration from data providers (not only governments) to a global-scale collaborative editing tool (Wikipedia) and back. Data providers need only to monitor Wikipedia changes and trigger some internal (probably human-driven) update process.

Consider, for example, DBpedia. The DBpedia project extracts information from Wikipedia and shares it in an entity-attribute-value structured format. One could imagine using data from DBpedia for information that is useful for multiple pages, for instance, in order to avoid redundant maintenance work. As the DBpedia extraction is a best effort semi-structured to structured transformation, errors can be introduced during the process. Wikipedians could then mark-up the dbpedia attribute with a corrected value and a source, providing the possibility of automatically updating DBpedia (and therefore all associated Wikipedia pages). I see such a process as the ultimate human-powered Web data fusion machine.

See also:
* Shortipedia is a wiki that allows to enter facts about anything.
The authors “hope for Shortipedia to move to the [Wikimedia Foundation] servers and become the common knowledge base for all Wikipedias”, a sort of “Wikimedia Data Commons”

Disclaimer: I work at the Freie Universität Berlin with the DBpedia team.

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: