How much do we leave of ourselves wherever we go? More and more.
The New Yorker has a fascinating piece (subscription needed) on forensic linguistics – the art/craft/science of identifying people – or trying to – by their distinctive style of writing. It’s a still-controversial field that nonetheless says much about the “digital exhaust” we’re increasingly leaving behind, and the possibilities for understanding and analyzing who we are and what we do.
The piece begins with Robert Leonard, head of the linguistics program at Hofstra University, comparing graffiti with emails written by an accused murderer – and how his testimony helped convict the man. His analysis was based on the similarity of the styles of writing – and equally importantly, the relative rareness of such styles compared to a database of equivalent messages.
In other words, the data showed how probable it was whodunit.
Or at least that’s what the jury believed. Whether this is science, or craft, or something else, is still open to debate. But what’s clear is that statistical analysis of how we use words will only increase as we write more and more, in forms that are largely digitized, creating huge datasets of human communications.
The field is bound to thrive on the ever-growing piles of what (retired Georgetown University professor Roger) Shuy calls “data.” Our embrace of personal media – e-mails, text messages, voice mail, tweets – has created an avalanche of tossed-off language, an evidentiary trail that linguists are getting better and better at following.
There’s already the Communicated Threat Assessment Database, a repository of more than a million words in “criminally oriented communications” that allow for analysis of linguistic patterns.
Of course, this kind of language analysis isn’t new. The Unabomber was caught after his massive manifesto was published, prompting people who recognized his style of writing to tip off the authorities. And in 1993, a Philippine Supreme Court Justice resigned four days after a story ran citing the similarities in style between his decision and that of one of the parties to the case in front of him.
But the difference now is one of scale, and technology. We have access to so many more examples of writing – not just because we can digitize what we’ve always written, but because we write so much more, in texts, tweets, blog posts, emails, and so on – and so much more computing power that we can uncover patterns we never knew existed.
The New Yorker pieces cites Carol Chaski, the executive director of the Institute for Linguistic Evidence and the president of Alias Technology, as someone who’s been working on an algorithmic model for identifying patterns in language. As her company site notes:
Each ALIAS module accesses the document database, analyzes the document extracting and counting specific linguistic patterns, implements a statistical analysis of the pattern counts, and reports an answer.
In sum, ALIAS works because it uses sophisticated and normally unconscious linguistic features which are very difficult for us to recognize, manipulate and control.
Put another way, her algorithm should find patterns that we don’t see – but nonetheless exist. And could put a person in prison.
That may seem like a scary concept – and certainly the notion that we’re increasingly trusting algorithms to come to conclusions that may be too complex for us to understand isn’t always a comforting one. And especially when the stakes can be as high as a jury verdict. But it’s likely to be increasingly a fact of life, especially as we provide more and more data for people to analyze.