DBpedia Spotlight’s First Participation on the Google Summer of Code
We have all watched with excitement as Google unfolded the Google Knowledge Graph, giving insight into answers for questions that we never thought to ask. Similar “knowledge graph” initiatives from researchers in academia and industry have been underway to develop a global graph of Linked Data, where structured data on the Web is directly available for programmatic access in standard ways.
One of the most prominent Linked Data sources is DBpedia, a data set built by sharing (as structured data) facts extracted from Wikipedia. DBpedia has been serving as a nucleous for this evolving Web of Linked Data, connecting cross-domain information from numerous data sources on the Web, including Freebase.com and, by transitivity, the Google Knowledge Graph.
DBpedia Spotlight is a tool for connecting this new Web of structured information to the good old Web of documents. It takes plain text (or HTML) as input, and looks for 3.8M things of 360 different types, interconnecting structured data in 111 different languages in DBpedia. The output is a set of links where ambiguous phrases such as “Washington” are automatically “disambiguated” to their unambiguous identifiers (URIs) Washington, D.C. or George Washington, for example.
During GSoC 2012, we had the pleasure and honor to work with 4 students to enhance DBpedia Spotlight in time performance, accuracy and extra functionality. The core model we use for automatic disambiguation is based on a large vector space model of words, and one student project (by Chris Hokamp) included processing all the data on Hadoop, as well as analyzing the dimensions of this model using techniques such as Latent Semantic Analysis, Explicit Semantic Analysis, etc. A second project (by Joachim Daiber) implemented a probabilistic interpretation of the disambiguation model, and provided a key-value store implementation that allows for efficiency and flexibility in modifying the scoring techniques. Our third project (by Dirk Weissenborn]) included topical classification in our model and live updating/training of the models as Wikipedia changes (or news items are released) so that DBpedia Spotlight can be kept up to date with the world, as soon as events happen. Finally, the fourth project (by Liu Zhengzhong — a.k.a. Hector) provided an implementation of collective disambiguation. In this approach, each of the things that are found in the input text contribute to finding the meaning of the other things in the same text through graph algorithms that benefit from the structure of our knowledge base.
Together, these four projects will greatly enhance DBpedia Spotlight towards achieving its objective of serving as a flexible tool that can cater to many different applications interested in connecting documents to structured data. By the way, through links between DBpedia and Freebase you can use DBpedia Spotlight to obtain and use links from Web documents to the Google Knowledge Graph. How exciting is that?
by Pablo Mendes and Max Jakob, DBpedia Spotlight co-creators and GSoc2012 Organization Administrators.
Featured on the Google Open Source Blog.