That adage goes to the heart of what much of journalism has traditionally been about – verification and corroboration (and some transparency as well). What can be verified, and by who? Do they have a bias? What documents back up the assertion? Are the other possible explanations?
But what happens when it’s an algorithm that comes to a conclusion? One that a newsroom writes, and publishes the results of? What are the standards for verification then?
Let me back up a second. This isn’t about how we do data analysis and come to conclusions; we’ve been interrogating databases for ages, and there’s pretty well-established principles for what constitutes a publishable piece of information.
But consider this great interactive that ProPublica put up a while back; it calculates and projects when particular states’ unemployment insurance systems might run out of funds. The calculations that go into it are all disclosed, and it had a great track record. It’s a useful service that takes public data further, through the prism of smart journalists who know their beat.
But that’s a big step away from classic journalistic methods of verification. Sure, you can and should talk to experts in the field to make sure the algorithm works; but this is a different kind of verification. What if, instead of unemployment insurance systems, there was an interactive that predicted bank failures? To be sure, such calculators exist – but they aren’t generally made public via a news organization. What are our responsibilities if we publish that kind of information?
As Big Data increasingly floods newsrooms, we’re turning more and more to sophisticated social science methods to sort, sift, group and analyze that information. In some cases we’ll use what we find as leads to chase down stories; in others we might well be publishing the results of our analysis. What standards of proof or validation do we need to develop, and how will we tell audiences why they should trust us?
The Guardian’s great visualization of how false rumors during the London riots spread, and were challenged, then eventually died via tweet and counter-tweet is another example. A key part of the visualization depended on the categorization of tweets into whether they were for, against or neutral on any given rumor – something a team of academic partners helped them out with.
…we needed to find which tweets belong to each cluster. Again, our academic partners proved invaluable, providing a parametrized Levenshtein distance algorithm for finding all tweets within a certain “distance” from each other in textual terms.
Once the clusters were identified, we developed a system to visualize their rise and fall over time. Sizing each tweet according to the influence of its author (determined by follower count), we added a decay function that would allow it to dissipate over time. As such, clusters grow and shrink as their theme is taken up by additional voices.
Our last challenge was to classify each tweet according to a ‘common sense understanding’ of its main role as a communicative act. Did it support, oppose, query or comment on a rumour? In addition to an algorithmic analysis by our academic partners, each tweet was independently coded by three sociology PhD students in order to enable us to check for reliability.
It’s complicated stuff – and full credit to the Guardian for putting it all out there and letting readers decide how much credence to give it. But the broader point is that as an industry we’re increasingly asking our audiences to accept increasingly complex methodology for what we do.
When we report that someone has said something, or a document contains some assertion – and assuming we reported accurately – it’s generally clear to readers what’s happened. (Although arguably there’s a need for much greater media literacy in general.) It’s up to them to assign a level of confidence to what we say, and how much they feel they trust the person or document we quote. But it’ll get much harder the more we move into a land of algorithmic number-crunching.
Consider sentiment analysis. If we – or someone else – says that tweets are increasingly going for or against a person, company or position, what we’re really saying is that an algorithm is categorizing the language in tweets one way or another.
Not that such algorithms are a bad thing; on the contrary, they can help us understand the flood of information flowing around us. But we need to be able to understand how they work, and what their strengths and weaknesses are.
And we definitely need to be transparent about what we’re doing. Still, let’s face it – not a lot of people are going to be able to understand how all this works in detail. And if we as an institution want to keep people’s confidence, we’re going to have to do a great job at explaining why they should trust all these new – and powerful – methods.