That’s not surprising. The possibilities inherent in being able to amass and analyze reams of information – from statistical analysis to machine learning to just plain data analysis – are astounding, and people are cottoning on to the fact that there may be big bucks in Big Data. The critical thing about Big Data is that it’s much more than just more data; it’s data on a scale that allows us to do completely different things.
Take Google Translate. At some level, it’s laughably bad – even if it’s free and does a passable job of translation. Read any translation it does and you’ll groan over the strange choices of words or the tortured grammar. Often a 12-year-old could do better. A drunk 12-year-old could do better. But carping about the quality of the translation misses the point; it’s astonishing that it a program that has no rules of grammar, no knowledge of syntax, no dictionary of vocabulary can translate at all.
That’s because all it really does is take masses of documents translated into different languages – say press releases from the UN in multiple languages – and apply statistical analysis to them, learning from patterns embedded in those documents and applying them to new documents. This approach doesn’t work if all you have to work with is one document, or even just a couple of hundred thousand pages; but feed in millions and then raw computing power can start to make sense of it. And then the more you correct its translations, the more it learns new patterns and improves. It’s an amazing – and somewhat scary – result. (Google’s feat is a bit like the joke about the man whose dog can play chess – when told it’s an astonishing achievement, he downplays it: “Sure, but he loses 4 times out of 5.”)
And so it is with Big Data in general. It’s at an early stage, and what it produces so far is the equivalent of losing 4 games out of 5. But it’s just getting going, and there’s no shortage of inputs. We’re awash in data – government statistics, consumer behavior, building sensors, and tons of free text – that we’re just beginning to be able to process. What will we be able to glean from all of this one day? And what does it mean for journalism?
Potentially lots – on the story front, on the tools front, and perhaps on the business model front as well.
There’s obviously lots of possible stories that can be gleaned if we can get our hands on some of that data and sift through for interesting patterns. As the Economist notes:
Tax authorities are getting better at spotting spongers (for example, by flagging people who claim unemployment pay as well as occupational-injury benefits). Health services are mining clinical data to gauge the cost-effectiveness of drugs. After a detailed study of its clients, the German Federal Labour Agency managed to cut its annual spending by €10 billion ($14 billion) over three years while also reducing the length of time people spent out of work.
Imagine the possibilities in being able to sift through huge troves of information, seeking out outliers and anomalies, as The Wall Street Journal did in its great series on oddities in Medicare claims.
And then if Big Data realizes its potential in statistical analysis, it can drive powerful tools that can, for example, extract much more meaning out of the pages of unstructured information being produced each day. Semantic engines like Open Calais (owned by Thomson Reuters, I should add) already do this, but additional work on statistical analysis of documents promises even more. The FT piece, for example, talks about the challenge of trying to understand free text.
This has been complicated further by the big growth in unstructured data – information, such as text, that is not organised in a way that a computer can easily process. With the volume of user-generated text and video growing rapidly, this has become one of the main focuses of technological development.
Chief among the new tools are natural language processing, which enables a computer to extract meaning from text, and machine learning, the feedback loops through which computers can test their conclusions on large amounts of data in order progressively to refine their results.
Can we make money out of this? McKinsey thinks so. Its report talks about huge potential productivity gains from better understanding what customers want and tuning offerings to those findings. Recommendation engines, for example – the systems that suggest stories you might want to read – depend largely on feedback loops generated by masses of data to be effective. Whether these will make a difference to journalism enterprises is a different issue. But there’s at least some grounds for optimism there.
Not that this is all an unmitigated good. It’s not even clear how good this is; and there are certainly lots of unanswered questions about privacy concerns, or what the limits might be to machine learning and statistical analysis.
Still, there’s much potential here. But whether this future is a year ahead, or five years, or fifty, we should keep working on how to help the process of building and analyzing data along. To repeat a old point of mine – even if we knew a strong semantic engine that could accurately and efficiently parse our stories was coming along in five years, why not help the process along by starting to structure our content now? Why not build the building blocks of a potential new information age? Why cling to the processes developed for a different age and different platform?
Big Data may indeed be big. But we can start small.