There’s been much written about the National Security Agency’s scooping up of massive amounts of communications metadata, representing perspectives from pretty much every legal/political/moral angle– so I’m reasonably sure there’s not a lot I’m adding on those fronts.
But the data-gathering program, first reported by the Guardian, also highlights two interesting issues about “big data” worth pointing out: In many ways, metadata is more valuable than content; and more importantly, the advent of cheap storage and massive computing power will fundamentally change the way we approach – or can approach – data.
The first point perhaps isn’t all that surprising – but it’s a good opportunity to revisit just how much you can learn simply from getting information about information. At a very simple level, as the Electronic Frontier Foundation notes, knowing that you called the suicide hotline from the Golden Gate bridge can be very telling, even without knowing what you talked about. (Or your credit card company suddenly seeing you rack up huge charges in Nigeria, when you’ve never so much as used your card outside the state before.)
But it’s in the aggregating and analyzing of tons of information that metadata’s value is clearest. Sociology professor Kieran Healy’s smart, funny – and scary – analysis of Paul Revere’s social network (written as if it was done at the time) really reveals how easily it might be to find terrorists – or revolutionaries. (See the network analysis chart above that puts Revere in the center of everything.)
Rest assured that we only collected metadata on these people, and no actual conversations were recorded or meetings transcribed. All I know is whether someone was a member of an organization or not. Surely this is but a small encroachment on the freedom of the Crown’s subjects.
All I know is this bit of metadata, based on membership in some organizations. And yet my analytical engine, on the basis of absolutely the most elementary of operations in Social Networke Analysis, seems to have picked him out of our 254 names as being of unusual interest.
Of course, such analysis only really works if we know what questions to ask of the data, and that while it may be clear in hindsight what we should have been looking for – for example, people wanting to learn how to fly planes, but not to land or take off – it’s often harder to figure out in real time.
And that’s often made harder by the way data is structured and captured. If we choose to the price and color of widgets we sell, but not the time orders are received, we’ll never be able to analyze if there’s spike in sales right after an ad airs. At least, not without going back and revising our data scheme.
But that’s changing.
In a smart, year-ago post, entrepreneur Alistair Croll notes that ability to store vast amounts of data, and then throw massive computing power at it means you don’t have to make those kinds of data structure decisions – and by extension, those kinds of analysis decisions – upfront. You can do that later – but only if you have the data.
For decades, there’s been a fundamental tension between three attributes of databases. You can have the data fast; you can have it big; or you can have it varied. The catch is, you can’t have all three at once.
In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context.
You decide what data is about the moment you define its schema.
But that moment can come much later now, he notes:
With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, Big Data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected—sometimes called a schema-less query. This means we collect information long before we decide what it’s for.
And that’s what, in effect, the NSA is doing. And that turns not just data analysis on its head, but as Stewart Baker, former general counsel of the NSA, points out, it also turns policing process on its head as well.
In the standard law enforcement model that we’re all familiar with, privacy is protected because the government doesn’t get access to the information until it presents evidence to the court sufficient to identify the suspects. In the alternative model, the government gets possession of the data but is prohibited by the court and the minimization rules from searching it until it has enough evidence to identify terror suspects based on their patterns of behavior.
That’s a real difference.
It’s as if the police had 24/7 video footage of every street corner in the city, but pledged never to look at it unless there was a crime committed, or they had or reasonable suspicion that a crime was going to be committed. (Although, to to make the analogy with PRISM more accurate, it’s as if they had video footage of whatever could be seen through your open window.) Ultimately much of the argument about that program hinges on how much you trust the government; but that’s a different debate.
Back to the collect first, analyze later issue: To be sure, I have some doubts about current technologies to easily create structure out of tons of unstructured data – but capabilities are improving all the time. So I don’t doubt we’ll at least go a long ways down that road.
So where does that leave us? In a world where data will increasingly get used for new and original purposes never envisioned when it was first collected. Sometimes that’s great – we get better services, or as journalists get to uncover more wrongdoing. Sometimes it’s indifferent. And sometimes it’s a problem – if smart analysis of data can help a company better serve us, it can also help it better discriminate against us.
As Alistair Croll noted in a separate post, anyone wanting to figure out the racial composition of neighborhoods in London need only consult this nifty visualization byJames Cheshire, a lecturer at the UCL Centre for Advanced Spatial Analysis. which maps locations by the most common surname staying in that area. When anyone can do this – essentially, with the data in a phone book – it’s much harder to prevent discrimination based on race.
But of course there are also lots of benefits to being able to quickly join and analyze datasets, not least for data journalists who have managed to uncover great stories by marrying disparate bits of information. Ultimately it’s going to come down to a question of how much any given society prizes privacy over other considerations.
Meanwhile, we might, as Alistair Croll notes, rethink our notions of what Big Data is – and focus more on data flexibility than size.
Sometimes big data isn’t about the bigness; it’s about the ease with which disparate data sets can be mined and linked. Nobody “owns” this information, but it can be put to all kinds of uses that people might not like.