Requirements for creating a Web of Data
I argue here that some of the most important requirements for enabling a Web of Data are: Globality, Open World Assumption, Distribution, Autonomy, Addressability, Unique Identifiers, Dereferenceability, Interpretability, Backwards Compatibility.
Subscribing to the Web of Data view, the Web is THE global database, your database is merely contained within that one global database (Globality). As a result, not finding an answer in your subset of the global database does not mean that an answer does not exist (Open World Assumption). Everybody can describe (e.g. add attributes to) things from any database in the Web and put those descriptions in yet another database or databases (Distribution). However, you do not necessarily have the ability to change data in other people’s databases (Autonomy). Your descriptions about other people’s data may stay in your own database, or any other database you are allowed update. Therefore each party is responsible only for the data they create, but not for what other people said about their data (Authority). So, in order to allow Autonomy and Distribution, people should be able to refer to Tables, Rows, Columns and Cells in any database within the context of the Web (Addressability). Since these references are with regard to a global database, they should be universally unique (Universal Identifiers). People need not only to make references to data, but also retrieve the data referenced by an identifier in order to, perhaps, mash it with data from other subsets (Dereferenceability). After retrieving, people should be able to interpret data correctly so that they can make use of the information conveyed (Interpretability). First, they need to understand the structure: am I looking at a Table, Row, Column or Cell from the source database (Structural Interpretability)? And further they may want to put these new data together with data from somewhere else, so it would be useful if we could easily understand that items in two different databases are referring to the same real world object (Vocabulary Homogeneity). We also want all of this new stuff to sit peacefully on top of (or alongside) the current Web of Documents, without breaking it (Backwards Compatibility).
Sure, not an easy job. Much discussion will still happen, but we are making progress.
How does the Linked Data stack attempt to realize the Web of Data?
Linked Data largely relies on the infrastructure that enables the Web: Web languages, HTTP protocol, DNS, URI, IP etc. The Universal Identifier used is the URI. One example URI is:
http://me.pablomendes.com. By using the DNS hierarchy, URIs enable Addressability and Authority — in order to find the owner of
http://me.pablomendes.com, you have to ask the owner of
com, that will redirect you to the owner of
pablomendes.com who will redirect you to the owner of
me.pablomendes.com. More about how address resolution happens here. Therefore I can refer to any resource on the Web via URI, at the same time finding who is responsible for it, and giving me the possibility to request more data about it. By using URIs that are also URLs (Uniform Resource Locators) – that is, they locate some content on the Web – we make URIs also dereferenceable. In other words, the “owner” of a URI should return a description of the object when an HTTP request is made to its URI (Dereferenceability). So, you should get some data if you point your browser to
http://me.pablomendes.com. But in what format is that data going to come? Plain text, HTML, JSON, XML? In order to increase Interpretability, it would be nice to use a common format to represent descriptions of objects on the Web. The W3C recommends the use of RDF. Through the use of RDF everybody is able to read the descriptions in terms of statements composed of Subject, Predicate, Object (or Entity-Attribute-Value). One example of those statements is
(me.pablomendes.com, name, "Pablo Mendes"). So now you know that this statement is talking about something identified by
me.pablomendes.com, stating that the value for the attribute
name is “Pablo Mendes”. But unless we speak the same language, you won’t necessarily know what
name means. That’s where the recommendation is to use widely known vocabularies, so that we increase the chances of others understanding our descriptions. These vocabularies should also be described as Linked Data in order to make them part of the global database as well. One could use, for example, the attribute
foaf:name from the “Friend of a Friend” vocabulary, to make sure that all other applications using the same vocabulary can easily reuse this information.
Since databases are global, but autonomously maintained, it is clear that heterogeneities will arise. Although the use of RDF increases the representational consistency, the old problems of Data Integration are not solved. The need for Schema Mapping, Duplicate Detection and Data Fusion becomes obvious in this context. But Linked Data brings an interesting perspective to the integration task that relieves the data publisher from the burden of single-handedly integrating every other database to his. Since the database is global and distributed, the burden of data integration can be split across parties. Data can be incrementally integrated by the data publishers themselves, by third-party publishers interested in connecting or even by Web users when interacting with your data. For more on this idea, take a look at Chris Bizer’s talk on “Pay-as-you-go Data Integration,” as well as on his upcoming book with Tom Heath.
Although the Linked Data stack may not the only possible solution to enable a Web of Data, it is the only one that – to the extent of my knowledge – is tackling the issue of “globalizing reuse” by focusing on the requirements I elicit here. Any other solution that falls short of that, also falls short of really evolving to a Web of Data.
DISCLAIMER: Linked Data is within my research interests, so it would be fair to expect some bias. :)