The Holy Grail of structured journalism is some kind of taxonomy that accurately and completely describes any kind of story – from the 14,000-word eight-part investigative series to the 250-word brief on a car accident.
That’s probably not going to happen – and if such a taxonomy could exist, it would probably make sense to invent the program to comb stories and extract it. One of the problems for programmers to deconstruct and parse free text is that such a taxonomy doesn’t currently exist – and may never exist.
So if we skip the Holy Grail and instead focus on working on reasonably-sized subsets of journalism stories – in other words, move up from standalone projects like Politifact (a great avenue to pursue as well) and down from a general taxonomy – what do we find?
Quite a lot, actually.
Court reporting under the UK legal system is one. It’s a highly templated kind of reporting, because of the way English law governs what can and can’t be said once a trial is underway. So among the fields you could create for this kind of story would be what day of the trial it was (to link with previous stories), who spoke, who examined/questioned, background on the case, the new testimony, etc. That speaks to being able, on the 11th day of the trial, to reassemble a story that has the day’s news (new testimony), who spoke, etc, chained to more persistent content – the previous 11 days’ new testimony, the background to the case, and so on.
Sports reporting is another clear example, at least in the context of match coverage – everything from the obvious (score, winner, loser) to publication-created context (MVP of the match, weather conditions, home or away, etc). That allows, in theory, for readers to see how often a favorite player is the MVP, at least as judged by the publication, how their team did at home or away, in various weather conditions, etc. To be sure, a lot of this is already done by sports sites. But a publication with a good sports staff may be able to create new data fields unique to them and hence add a new kind value to their reports and data.
That’s part of the goal with an idea I had for the South China Morning Post’s Racing Post site, which is to add SCMP-value-added (racing tips) with other publicly available data on which horse ran, with what jockey, under what conditions, etc.
Disaster reporting is more complex, but there are clear data fields that apply to all: Deaths, missing, financial cost, etc. It’s not that it’s hard to extract that data from a story, or even that it would be a significant efficiency improvement to be able to feed such data straight into a website; it’s that having the data allows for more applications, some of which are useless and some of which may be very valuable. If we have death toll mapped against time, we can see how the numbers rise; if we have death toll vs. financial cost, we can look at correlations. These may or may not be important; but if we have them, we can find out.
I’m sure restaurant reviews are already taxonomized and entered in databases in many publications; but are we collecting all the data fields we could? Should we be able to sort by chef, type of cuisine, opinion of the reviewer; time of review; and so on? Can we map and plot over time?
Ditto police stories. Are we capturing all the data so we can build our own map/story/aggregated data or figuring on tapping the police database when we need it? If we don’t build our own data structure with our own content – while tapping into the police database as well – we essentially lose the value of what we created. Can we marry our court coverage database with the police database, so that over time we have a sense of how the justice system is doing?
Other ideas: General safety stories (product recalls, etc), car crashes, fires, war deaths/injuries/incidents, book reviews.
The difference between this and classic CAR work is that the emphasis here is on creating the data from day-to-day work, with an eye to then tapping that data later on; in CAR, it’s about asking the questions after the fact and then trying to assemble the data to interrogate. Both approaches make sense – but structured journalism is more scalable.