Posted by: structureofnews | July 23, 2012

The Trail We Leave

How much do we leave of ourselves wherever we go?  More and more.

The New Yorker has a fascinating piece (subscription needed) on forensic linguistics – the art/craft/science of identifying people – or trying to – by their distinctive style of writing.   It’s a still-controversial field that nonetheless says much about the “digital exhaust” we’re increasingly leaving behind, and the possibilities for understanding and analyzing who we are and what we do.

The piece begins with Robert Leonard, head of the linguistics program at Hofstra University, comparing graffiti with emails written by an accused murderer – and how his testimony helped convict the man.   His analysis was based on the similarity of the styles of writing – and equally importantly, the relative rareness of such styles compared to a database of equivalent messages.

In other words, the data showed how probable it was whodunit.

Or at least that’s what the jury believed.  Whether this is science, or craft, or something else, is still open to debate.  But what’s clear is that statistical analysis of how we use words will only increase as we write more and more, in forms that are largely digitized, creating huge datasets of human communications.

The field is bound to thrive on the ever-growing piles of what (retired Georgetown University professor Roger) Shuy calls “data.” Our embrace of personal media – e-mails, text messages, voice mail, tweets – has created an avalanche of tossed-off language, an evidentiary trail that linguists are getting better and better at following.

There’s already the Communicated Threat Assessment Database, a repository of more than a million words in “criminally oriented communications” that allow for analysis of linguistic patterns.

Of course, this kind of language analysis isn’t new.  The Unabomber was caught after his massive manifesto was published, prompting people who recognized his style of writing to tip off the authorities.  And in 1993, a Philippine Supreme Court Justice resigned four days after a story ran citing the similarities in style between his decision and that of one of the parties to the case in front of him.

But the difference now is one of scale, and technology.  We have access to so many more examples of writing – not just because we can digitize what we’ve always written, but because we write so much more, in texts, tweets, blog posts, emails, and so on – and so much more computing power that we can uncover patterns we never knew existed.

The New Yorker pieces cites Carol Chaski, the executive director of the Institute for Linguistic Evidence and the president of Alias Technology, as someone who’s been working on an algorithmic model for identifying patterns in language.  As her company site notes:

Each ALIAS module accesses the document database, analyzes the document extracting and counting specific linguistic patterns, implements a statistical analysis of the pattern counts, and reports an answer.

In sum, ALIAS works because it uses sophisticated and normally unconscious linguistic features which are very difficult for us to recognize, manipulate and control.

Put another way, her algorithm should find patterns that we don’t see – but nonetheless exist.  And could put a person in prison.

That may seem like a scary concept – and certainly the notion that we’re increasingly trusting algorithms to come to conclusions that may be too complex for us to understand isn’t always a comforting one.   And especially when the stakes can be as high as a jury verdict.  But it’s likely to be increasingly a fact of life, especially as we provide more and more data for people to analyze.


  1. This kind of work, or at least its anticipation, has been carried out in academic circles for some time. Called stylometry, its aim was perhaps most famously pursued by Fredrick Mosteller and David Wallace in the 60s with their analysis of the likely authorship of the Federalist Papers.

    • Abbott, absolutely. The difference now is in the richness of the data available, which means we can apply more statistical rigor to analysis now. Reg

  2. I assume it’s only a matter of time before someone — probably a well-intentioned privacy advocate thinking of dissidents under repressive regimes, perhaps someone with more malicious intent — creates software that is designed to smooth out such telltale indicators of authorship. Then we’ll be off to the races…

  3. Theo, it’s true – it’s an arms race of sorts… although there’s no much stuff/data out there now, and equally increasing computing power that it’s hard to imagine that all of this information can be stuffed back into a bottle. Reg

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: