Posted by: structureofnews | September 20, 2010

The Sex Appeal of Plumbing

OK, there isn’t any.

And there probably isn’t any sex appeal in structured journalism, either – it’s very nuts and bolts, data-structured-related stuff that should be invisible to readers. But, like plumbing, it should be an essential part of the infrastructure.

So it won’t have the wow impact of great visualizations or the gee-whiz factor that a well-designed website exudes.

It’s much more basic – what are the fields we want to capture in any given type of story; how do we categorize stories and elements in a story; how do we extract key information and store it; and so on. But if we get it right – as Matt Waite did with Politifact – what we have is the underlying structure that lets us build stronger and smarter applications on top of it.

That’s provided we design the plumbing well. That means making choices early on, and finding ways to amend them when they aren’t working out. That’s one reason why holding out for a technology-oriented solution to parsing free text is so seductive; we don’t have to make any decisions now, since the technology will figure it out for us eventually. But we do, and should.

For plumbing to work, enough people have to know how to use it. That means forcing a large enough newsroom – or a focused-enough small newsroom – to pick one structure and stick to it, at least until they see it’s not working well. And in an ideal world, other allied newsrooms would also use the same structure, so the two sets of databases can talk to each other.

So let’s say we were building a relationship database, and decided to capture people’s names, affliations, connections, in a particular format.  And say the Washington Post wanted to do the same thing, via their WhoRunsGov site.  If the SCMP has a great database on Chinese movers and shakers, and the WP has one of the US government, the smart thing is for both of us  to share our databases and really increase the value of the shared resource.

But to do that, we’d have to agree on the underlying data structure (not to mention a whole bunch of other things).  But if we could agree, that would really unlock a lot of potential value.  We wouldn’t even have to be wedded for life; it’s probably possible for both sides to track their contributions to the dataset and pull them out if the relationship broke up.  (It would be ugly, and nasty, and much of the database would get wrecked.  But that’s like many breakups.)  And there may be arguments about whether each side is pulling its weight in contributing information to the database.  A bit like fights over whose turn it is to take the garbage out.

But even without a marriage, shared data structures would make it easier for both sides to talk and share information, even on an ad-hoc basis.

Everyone will want their own structure, of course.  But compromise – or the imposition of an agreed standard by a consortium/group – could help goose value and cooperation.

That, like shared railway gauges, might determine who allies with who and what they get out of it.


  1. There’s some evidence of collaboration already. One purpose of the evolving W3C linked-data formats (RDF triplestores, etc.) is to allow collaborative construction of taxonomies of all types, because it allows different people to publish taxonomies for specialized areas (politicians, place names, animal species…) that can be used seamlessly together. There are in fact evolving RDF-based standards of the type you suggest, e.g. OWL, SKOS (see

    There are also a few start-ups in this space with custom solutions, such as which is essentially a triplestore wiki. I get the feeling they know what they are doing technically, but we’ll see if they get traction.

    In short, everyone has the problem of computer representation of reality, so it’s going to be solved with or without the news industry’s involvement. The question is, how fast will newsrooms understand that it’s valuable to them, and who will be the first to exploit it fully?

  2. I am not sure that an explicit “marriage” of news gathering organizations needs to exist to create value. A news organization can start with “baby steps” flagging in each article easily identifiable “data” – name, place, date. There’s little ambiguity about these. Plus, there are conventions for microformats already well established which would make these visible to search engines and allow the creation of value from the applications built on top of the data. As long as a news organization conforms to existing microformat standards, they don’t really have to negotiate over the marriage. They just need to contribute to the global volume of tagged data.

    Since not all reporters will take the time to format the information, a CMS that has a layer or function that prompts the reporter with – Is this a name? Does this refer to this person? No? Please create an entry for the person… Pop-up simple entry form. Name, Dates, Wikipedia reference, short description of the person’s role….

    Anyway, wonder if there is a Drupal ap for this?

    • It’s true, microformats could be a good start. We’re not even doing that. Although, to be fair, part of the problem is that no one yet sees a neat product/visualization out of it, so there’s only a theoretical return on that investment of time.

      So one goal could be to take some existing set of information in microformats and turn it into something that sells – to readers and newsrooms.

      Although I do think that in the long run we need to define more types of data and relationships and agree on a more-common standard.

      Having a CMS that pulled that data would be even better – anyone want to volunteer to talk to CCI?

  3. Jonathan, thanks. I’ll explore the links – I do think, as you say, that the key question for journalists is when and how fast they get with the program and try to leverage their current advantages (daily reporting, beat coverage, etc) into this new world.

  4. Yes, getting it right early on is important. Let’s say if I have a database of news agencies / press bodies in China, maybe I would start off with the original schema. In this case, I had something to get it off the ground, which is the schema used by the GAPP, the Chinese governmental orgn regulating press and publication. In this way, I got it “right”, and I will certainly make it better by extending it with my own columns/attributes, linked objects and what not.

    Then, we might also want to have a database of people. Like your problem of making your DB be able to talk to the one at WashPost, I’d like my DB to be able to talk with yours. Somehow, find a way to stow them together (“faire arrimer ensemble”, which is how I always think of). They should be compatible, although I think the priority is to get those DBs off the ground. I personally have the bad tendency to worry too much about what’s ahead. So as a piece of advice for myself (and others maybe who see themselves in what I say) that it is best to have a good, not perfect, schema to organize the information — and a schema that will be easy to refactor, but most importantly fillable within the current reality of our daily busy lives. Maybe my “news agency” object has a “supervising orgn” stored as a name string now. In the future, there’s no reason why this orgn wouldn’t be some kind of object… As long as we can re-parse our data, do some manipulations/addition/substraction, I think we’re safe. Standards? That’d be nice.

    Then there’s the other approach of making everything an atom belonging to one generic kind. I’ve seen it in other data(base) designs where everything in a music website, from album review articles, to an “artist” being stored in one single table. And what they “are” and where they “belong” is defined by another table/kind called a “tag”. A really interesting approach too.

    • Cedric, I think there will be a bit of both – I agree that waiting for the perfect structure will leave us waiting. And so just going ahead with a relatively simple, but flexible structure is once way of moving forward.

      Another method is to build something that works, and let people come to it. Politifact is one example; it started in St. Pete, covering national politicians, but now it’s been “franchised” to a number of news organizations in other states. I haven’t checked to see how or if all of those independent Politifacts can talk to each other, or what they could come up with if they did, but that’s a though.

      Somewhere in between is the notion that we can’t capture all relevant information in any story, and shouldn’t try. Ultimately information capture – except in science-fiction worlds where technology does it all for us – is a resource decision: If I spend time recording someone’s age, I don’t have time to get his place of birth. Or vice versa.

      So part of this is thinking about what linkages are useful in some kind of product – whether something immediately for a project, or something with a much longer shelf-life. Then we build the data structure for that.

      That’s partly why I think we’ll come to a point of quasi-marriages of projects and products, rather than a broad schema that works for everyone and everything. Although we’ll also likely go some ways in that direction.

      It’s all good. I hope.

  5. Though I’d point everyone to a neat picture which illustrates just how underway the process of collaborative taxonomy building already is: checkout the map on

    The linked data effort is about publishing data, but this isn’t possible without ontologies (taxonomies). Of course an ontology is just a database as well.

    It should surprise no one to learn that the database version of Wikipedia, DBPedia, provides the central ontology for very many things. In fact the BBC use this in their Wildlife Finder site to provide the ontology of animal species. The great thing about that, they told me, is that if WP turns out to be missing a species or scientifically inaccurate, the editors just fix it!

    • Looks interesting. One key question we have to ask is how to get journalists to fill out these databases during the course of ordinary work – with or without help from technology.

  6. […] News organizations should think about cooperating not just on stories but on taxonomies […]

  7. […] makes lots of sense. It’s not the sexiest thing to talk about, and grand theories will only go so far without a working CMS to power them, but this […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: