Skip to content

Scala using GPUs: First Step

Good news: your notebook probably comes with a GPU (e.g. MacBook Pro, Lenovo ThinkPad, etc.) Running some experiments with Scala using your GPU should be easy, but the libraries are evolving fast and documentation doesn’t always keep up. Because the tutorials I found didn’t work for me right off the bat, I decided to post the sequence of steps that worked for me here (on MacOS).

  • install scala 2.9
    brew install scala29
  • make sure you’re running scala 2.9
    scala -version
  • create a file called build.sbt
    scalaVersion := "2.9.2"
    resolvers += "Sonatype OSS Snapshots Repository" at ""
    resolvers += "NativeLibs4Java Repository" at ""
    libraryDependencies += "com.nativelibs4java" % "javacl" % "1.0.0-RC2" // force latest JavaCL version
    libraryDependencies += "com.nativelibs4java" % "scalacl" % "0.2"
    autoCompilerPlugins := true
    addCompilerPlugin("com.nativelibs4java" % "scalacl-compiler-plugin" % "0.2")
  • Create a test class
    import scalacl._
    import scala.math._
    object Test {
      def main(args: Array[String]): Unit = {
        implicit val context = // prefer CPUs ? use
        val rng = (100 until 100000).cl // transform the Range into a CLIntRange
        // ops done asynchronously on the GPU (except synchronous sum) :
        val sum = * 2) => p._1 / (p._2 + 1)).sum
        println("sum = " + sum)
  • Run the test class with sbt
    SCALACL_VERBOSE=1 sbt "run-main Test"


When the term Big Data dies, what comes next?

Technology executive John De Goes points out that the term “big data” has been abused for overselling just about any data product, and indicates new, more instructive terms that the industry should focus on:

As the industry matures, there won’t be a single term that replaces the big data moniker. Instead, different tools and technologies will carve out different niches, each more narrowly focused and highly specialized than the universal sledgehammer that was big data.

The list includes:

  • Predictive Analytics: employing advanced techniques in statistics, machine learning, pattern recognition, data mining, modeling, natural language processing, and other fields to identify and exploit patterns.
  • Smart Data: heavily relying on predictive analytics to monetize large volumes of data
  • Data Science: extract meaning from large amounts of data to create new data products — e.g. employing predictive analytics 
  • NewSQL: highly-scalable distributed SQL systems, a child of the good old RDBMS with the new and rebellious NoSQL.
  • Other trends: streaming analytics, NLP, multimedia mining, in-memory storage and computing grids, and graph databases.


If you want to learn about Big Data or influence its future in Europe and beyond, come join us in the discussion at the Big Data Public Private Forum project.

Science is the difference between instinct and strategy

Daniel Tunkelang delivered the keynote address at a CIO Summit 2012. In his talk he argues for science as a strategy. He starts by mentioning that in many organizations, the decisions are all too often made by the HIPPO (Highest Paid Person’s Opinion), and proceeds to offer three suggestions.
  1. Ask the right questions
    • In computing: garbage in garbage out. In Analitycs: we will not get the right answers if we don’t ask the right questions.
    • Example: LinkedIn switching from measuring accuracy of interviewer to optimizing tension between releasing weak candidates early and not missing the good ones.
    • Is asking the right questions an art or a science?
  2. Practice good data hygiene
    • Given enough data and computational resources, you can find anything.
    • Example: if a NFC team wins the Super Bowl, stock market goes up (with 80% accuracy).
    • Correlation is not causality.
    • Separate hypothesis generation from hypothesis testing.
  3. Don’t argue when you can experiment
    • Recent research indicates that reasoning may exist not for the purpose of finding the truth, but as a way of persuading people.
    • Example: Amazon offering product recommendations at checkout. Initially knocked down by instinct; later shown to be positive in experiments, and after implemented drives 5% of Amazon’s business for recommendations.
    • Why argue when you can test?

The video Science as a Strategy is available online and it is well worth watching.

Issues like the ones mentioned by Daniel are being discussed in the Data Analysis working group of the EU Project BIG. You should join the conversation!

Re-scanning only one page and updating pdf

Sometimes you are asked to print a document, sign and send back by e-mail. If it is a 25 page document, you don’t want to scan everything again. You can work only on the pages that need your signature.

1) Break the input PDF (unsigned) into pages. Assuming 25 pages in the original doc (large-document.pdf), do:

for i in $(seq -f "%04.0f" 25) ; do pdftk large-document.pdf cat $i output pg_$i.pdf ; done

As a result, you will get one pdf file per page, where the output is named pg_0001.pdf for page 1, …, pg_0025.pdf for page 25.

2) Print the only pages that need signature, sign and scan. For page 25, save as scanned_25.png

3) To convert the png to black and white, go to Gimp 2.6, menu “Tools / GEGL Operation / c2g” (color to gray). See:

I also had to resize the images to 8.5 x 11 inches (US letter size).

4) To convert a signed page (scanned_25.png) to PDF use ImageMagick. Use the right file name so as to substitute the old pages (unsigned) by the signed ones:

convert -size 8.5x11 scanned_25.png pg_0025.pdf

5) To glue it all back together:

pdftk pg* cat output - > pg_all.pdf

On Windows, you can do that via Cygwin. Just make sure to install all necessary packages, including pdftk, imagimagick, etc.

When Big Data Goes Bad

In episode 6 of the RadioFreeHPC‘s podcast, they discuss a couple of examples of when Big Data goes bad. These are the articles mentioned:

Here are the conclusions I noted down (often with my own bias) from the podcast and from the articles:

  • Be careful with conclusions: One cannot blindly follow any software. People making decisions based on software-generated advice should have at least a basic understanding of what data goes into the software and what are the assumptions made.
  • Be careful with input: Parameters used in the software must be chosen with good reason.
  • Be careful with output: Unexpected results must trigger further investigation. “Computer says no” [1] is no valid answer.
  • Care about additional information/feedback: Do not discard any evidence that may contradict the initial findings without careful consideration.
  • Be careful with decision making: Only enable people to execute the kinds of analysis that they are prepared to execute correctly.
  • Care about visibility: make sure everybody that should see the data has seen it.
  • Care to ‘connect the dots’: need an integrated view over different systems/datasets that influence some decision.
  • Care about usability: “effectiveness of any technology is down to the people that use it.” Systems/recommendations must be understandable and easy to use by the people who make decisions, not system architects or statisticians.
  • Be careful with complex systems: as systems get stacked over legacy systems, even IT personnel loses track of what exists and where. Care about simplifying the stack.

All of these topics and more are being discussed in the context of the EU FP7 BIG project. If you’re interested in these discussions, there are a few discussion lists that you can join to contribute to the conversation. For example, two groups relevant for the discussion above:

Big Graph Data Panel at ISWC2012

Although not in a conventional blog post style, I would like to share my notes from the (may I say, fantastic) Big Graph Data Panel at ISWC2012 in Boston, MA, USA. The panel was moderated by Frank van Harmelen and composed by Michael Stonebraker, Tim Berners-LeeJohn Giannandrea and Bryan Thompson.

Disclaimers: Although I tried to be faithful to what I heard at the panel, please do not take the attributed sentences as word-by-word quotes. There is certainly some unmeasured amount of interpretation and paraphrasing that went into this. There is also bias towards topics (or rhetoric) that picked my interest, and unfortunately lack of attribution (or misattribution) due to the speed under which I was required to type all the notes as people were speaking. :) I still hope the content is useful for your understanding of the topics.


What is Big Data? What do Semantic Web and Big Data offer to each other? 

Stonebreaker: Big Data encompasses three types of data problem.
– volume: you have too much data. For example, all of the Web.
– speed: data comes at you too fast. For example, query logs, sensor data, etc.
– variety: you have too many sources of data. For example, a pharma company with thousands of spreadsheets each by individual researchers, no common language (German, English, Portuguese), different writing styles, vocabulary, etc.

In other words, as put by Deborah McGuiness in her tweet: Big Volume or Big Velocity or Big Variety.  These “Three ‘Vs’ of Big Data” were originally posited by Gartner’s Doug Laney in a 2001 research report.

Stonebreaker: Big Data is only a problem if your data need grows faster than memory gets cheaper.

Giannandrea: First thing is to understand what a Graph Database even is.

Berners-Lee: everything can be structured in a graph. Saying graph structured data is like saying “data data”.

Do we even need graph databases? Don’t relational database systems already solve everything?

Stonebreaker: “major DB vendors are 30 years old obsolete systems that are not good at anything”.

Stonebreaker: unsolved problem how to do graph problems at scale, meaning that whatever aggregate memory you have cannot fit all.

Paraphrasing Deborah McGuiness‘ quoting of Stonebraker: Let the benchmark wars begin. If winners are 10x better, then you survive (if they are only 2x better, than the giant companies will take you over).

Would anybody with Big Graph Data problems use SPARQL? Or even SQL? Or must it be MapReduce?

Stonebreaker: About SPARQL, don’t get hung up on your query language. In the Hadoop world, everybody is moving to Hive. Hence, all SQL vendors are starting to write Hive2SQL translators.

Stonebreaker: About MapReduce, it is not the final answer. Google wrote mapreduce 7 years ago. It is good at embarrassingly parallel tasks. Joins are not embarrassingly parallel.

What about Open Data? Is there any incentive for it?

Van Harmelen: “standard anecdote: the incentive for opening up your data is that if you get successful, your servers burn down.”

Stonebreaker: Biggest problem is trying to put stuff together after the fact that was not designed to be put together.

Stonebreaker: deduplicating fuzzy data is one of the killer problems.

Stonebreaker: “there is tremendous value in curating the data.”

Sheth: explicit, named relationships make deduplication easier.

What about the original Semantic Web idea, of querying a distributed graph on the Web?

Stonebreaker: “query response time in the distributed way is as slow as the slowest provider. people centralize to speed it up.”

Attendee: “centralizing only makes sense if you know what u want to do. Putting data out on the Web enables people to find it out.”

Someone (I think Thompson?): “Semantic Web research needs to find out what is the right bit of “well curated data/process/schema/query” to add on top of big data.”

My question to the panel: What will Big Graph Data look like in 2022? Solved or beginning? Volume? In silos or global?

Giannandrea: in 2022 we will understand data better
Thompson: we will still not get the semantic interoperability across systems by 2022
Stonebreaker: big data will be a bigger and bigger problem at least for the next decade
Berners-Lee: there is a battle of small vocabularies for interoperability coming in the next years


Participate! If you would like to continue discussing the future of Big Graph Data, you should join the mailing lists of the FP7 BIG Project. The Working Groups on Data Analysis and Data Storage are particularly relevant to this discussion. Check them out!


DBpedia Spotlight’s First Participation on the Google Summer of Code

We have all watched with excitement as Google unfolded the Google Knowledge Graph, giving insight into answers for questions that we never thought to ask. Similar “knowledge graph” initiatives from researchers in academia and industry have been underway to develop a global graph of Linked Data, where structured data on the Web is directly available for programmatic access in standard ways.
One of the most prominent Linked Data sources is DBpedia, a data set built by sharing (as structured data) facts extracted from Wikipedia. DBpedia has been serving as a nucleous for this evolving Web of Linked Data, connecting cross-domain information from numerous data sources on the Web, including and, by transitivity, the Google Knowledge Graph.
DBpedia Spotlight is a tool for connecting this new Web of structured information to the good old Web of documents. It takes plain text (or HTML) as input, and looks for 3.8M things of 360 different types, interconnecting structured data in 111 different languages in DBpedia. The output is a set of links where ambiguous phrases such as “Washington” are automatically “disambiguated” to their unambiguous identifiers (URIs) Washington, D.C. or George Washington, for example.
During GSoC 2012, we had the pleasure and honor to work with 4 students to enhance DBpedia Spotlight in time performance, accuracy and extra functionality. The core model we use for automatic disambiguation is based on a large vector space model of words, and one student project (by Chris Hokamp) included processing all the data on Hadoop, as well as analyzing the dimensions of this model using techniques such as Latent Semantic Analysis, Explicit Semantic Analysis, etc. A second project (by Joachim Daiber) implemented a probabilistic interpretation of the disambiguation model, and provided a key-value store implementation that allows for efficiency and flexibility in modifying the scoring techniques. Our third project  (by Dirk Weissenborn]) included topical classification in our model and live updating/training of the models as Wikipedia changes (or news items are released) so that DBpedia Spotlight can be kept up to date with the world, as soon as events happen. Finally, the fourth project (by Liu Zhengzhong — a.k.a. Hector) provided an implementation of collective disambiguation. In this approach, each of the things that are found in the input text contribute to finding the meaning of the other things in the same text through graph algorithms that benefit from the structure of our knowledge base.
Together, these four projects will greatly enhance DBpedia Spotlight towards achieving its objective of serving as a flexible tool that can cater to many different applications interested in connecting documents to structured data. By the way, through links between DBpedia and Freebase you can use DBpedia Spotlight to obtain and use links from Web  documents to the Google Knowledge Graph. How exciting is that?
by Pablo Mendes and Max Jakob, DBpedia Spotlight co-creators and GSoc2012 Organization Administrators.
Featured on the Google Open Source Blog.

Get every new post delivered to your Inbox.

Join 543 other followers

%d bloggers like this: