The relevance of a computer called ‘Watson’ and a television game show to electronic disclosure

A computer with a homely name like ‘Watson’ and a US quiz show may sound like trivialisation of the serious subject of electronic discovery / eDisclosure. Equally, a reference to ‘Probabilistic Latent Semantic Indexing” sounds way over the top for a non-technical audience. What if we ally the speed of a computer with the sophistication of software algorithms to mimic human thought-processes? New skills are needed.

Let me make it clear right from the start that I do not understand the deeper technology behind Probabilistic LSI and that I nearly overlooked the many articles about IBM’s computer ‘Watson’. I got the message that ‘Watson’ had beaten the star contenders in a US television quiz game called ‘Jeopardy!’, and gathered also that many commentators at the geek and nerd end of the electronic discovery world were excited about it. What I missed was the experiment’s potential for explaining in lay terms what one might expect from the higher end of eDiscovery / eDisclosure applications. It was only when I caught sight of the name Recommind in one of the articles that I thought I had better read further.

Recommind is one of the sponsors of the eDisclosure Information Project, and I am familiar with the user interface which puts a friendly face on what are evidently extremely sophisticated functions. Recommind is not the only provider of intuitive front-ends to complex algorithms, and I pick on it mainly because it was the most familiar name in the first article I read about ‘Watson’. My purpose, however, is to use the Jeopardy! example to illustrate the searching power of some of the tools available to lawyers faced with a very large volumes of data. Most lawyers are familiar with keywords, because they use them every day in Google, and treat Google as a simple keyword matching tool – Google is in fact very much more sophisticated than that, but most of its users neither know nor care as long as they get an answer to their question in the first few hits.

eDiscovery obligations, however, require more than gathering the first few hits or even the first few thousand hits. They also require more than simple word matching, yet many lawyers reject (that is, do not even look at) such tools because of perceived reliance on a “black box”. The ‘Watson’ and Jeopardy! example gives us a good explanation in lay terms which may help break down these fears. (There are other fears, to do with the consequential potential loss of lawyer roles and jobs, which I will come on to in my next article).

Let us start with the article which caught my attention. It is called Understanding ‘Watson’ and the Age of Analytics by Nick Brestoff, and it links to another article called Why Watson matters to lawyers. The first part of the Brestoff article describes the Jeopardy! setup (but see below for more on this) and the second half links this to the lawyer’s task and to the tools provided by Recommind and others. The connection lies in the need to find patterns in unstructured data, that is, data which lies in a wide range of individual documents – e-mails, word files, spreadsheets etc – as opposed to (or, rather, in addition to) that which sits in neat rows and columns in a structured database.

There are links from this article to explanations of Probabilistic LSI and you can read about Recommind’s technology here. Between them, these pages give you the idea that applications of this kind are doing very much more than looking for the words entered as queries. As Nick Brestoff puts it (quoting an IBM expert)

“Watson does not take an approach of trying to curate the underlying data or build databases or structured resources, especially in manual fashion, but rather, it relies on unstructured data — documents, things like encyclopedias, web pages (note that Watson is not connected to the Internet when it’s playing), dictionaries, unstructured content … Similarly, when we get a question as input, we don’t try and map it into some known ontology or taxonomy, or into a structured query language. Rather, we use natural language processing techniques to try and understand what the question is looking for … “.

Like its rivals (and I often quote good illustrations which come my way), Recommind gives as a simple example that its platform “is able to automatically understand that words can have multiple meanings (e.g. “Java” in reference to the island and “Java” in reference to the programming language) and that there are multiple ways to express the same concept or query”. So far so good – these applications can understand what I mean and not just what I say.

For the next stage, we need an explanation of the game Jeopardy! and ‘Watson’s’ role in it, and I have no compunction in turning to Wikipedia for this. All we need from it is an explanation of its questions; we see that Jeopardy! contestants are given the answers and that the winner is the first to give a reasonably (not necessarily precisely) framed question. The example given is “The Father of Our Country; he didn’t really chop down a cherry tree”. ‘Watson’ had 200 million pages of structured and unstructured data to draw on and the humans had their general knowledge. The speed of retrieval is more obviously a benefit in Jeopardy! than it is in electronic discovery /edisclosure – you would sacrifice a few seconds to be sure of the right answer in litigation whereas parts of a second may lose you a round in Jeopardy!

That leads into another point about search, similar to the one made about Google above. A Jeopardy! contestant can take risks and might reckon that the first half of the query – “the Father of Our Country” is enough to say “Who was George Washington?” without waiting for the bit about Washington’s childhood experience as a lumberjack. This is not good enough for a discovery query if there is, in fact, more available information in the query to be processed. You can see the effect with Google’s new “Instant” search, which constantly refines the query and the answer set as you type. You can see also from Google that the exact expression “Father of the Country” produces 4.5 times as many hits as “Father of our Country”, which is itself pretty salutary for those who think themselves adequately served by keyword or string matching technology.

American competitors in an American game-show may reasonably feel themselves entitled to plump for Washington because he is the father of their country. To a Roman, however, pater patriae was Augustus, whilst a Turk would think automatically of Mustafa Kemal, known as Ataturk. To further confuse things, not every document which includes “Washington” and “Cherry tree” is directly about George – there is a pub called The Cherry Tree in Washington in the north of England, and Japan gave several cherry trees to the city of Washington in 1912, for example.

Well, says the lawyer, all this proves my point, does it not? If your systems cannot distinguish between an English pub, a Japanese arboreal gift and the first President of the United States what use are they to me? I am not interested in Augustus or Ataturk, I just want George Washington. I will do it my way, and get a team to read the documents.

The obvious (but not the only) problems here are time and cost. Assume that you have 200,000 pages in place of the 200 million available to Watson. There will be duplicates and near-duplicates amongst them, as well as e-mail text which recurs because it was repeated in a thread. Your human reviewers will have to plough through all 200,000 pages, and you can presumably calculate how long that will take, how big a team you need, and what the cost will be. And yes, assuming that their eyes actually light on references to any of the words and ideas referred to above, their expensive educations will allow them to discriminate between pubs and presidents, trees in Washington and trees chopped down by Washington, and between the 30 or so countries whose founder is known as “the Father of Our Country”.

In pure cost terms, you can get an estimate from a software provider to set against your calculation of man-hours. There are two other factors, however, to bear in mind. One is that your human reviewers are most unlikely to pick up all or even most of the references – the Washington example is only one of many things which you want them to look out for and they are, of course, only human, with hangovers, lovers, debts and day-dreams to distract them from their less-than-enthralling task of turning hundreds of pages per day. Your primary assumption, that human review is the gold standard, is flawed. Your parallel assumption, that the software will both miss things which are relevant and return masses which are irrelevant, under-estimates what modern applications are capable of.

The software is not simply looking for keywords or, rather, it is not only looking for keywords. What it is doing is executing many different algorithms to find words and phrases which are statistically related to the components of the search query and checking them against each other. Ken Jennings, one of the people who competed against ‘Watson’ in the Jeopardy! game put it this way:

The computer’s techniques for unraveling Jeopardy! clues sounded just like mine. That machine zeroes in on key words in a clue, then combs its memory (in Watson’s case, a 15-terabyte data bank of human knowledge) for clusters of associations with those words. It rigorously checks the top hits against all the contextual information it can muster: the category name; the kind of answer being sought; the time, place, and gender hinted at in the clue; and so on. And when it feels “sure” enough, it decides to buzz. This is all an instant, intuitive process for a human Jeopardy! player, but I felt convinced that under the hood my brain was doing more or less the same thing.

The parallel between electronic discovery and Watson’s involvement in Jeopardy! ends there – e-discovery involves finding as many as possible of the documents meeting the query, not providing a single answer and, as I have said, there will be multiple sets of criteria to process, many of which will return overlapping responses.

This approach has two other advantages over manual review. The multiple cross-checks by several criteria allow more than a simple binary is-it-relevant-or-not? approach – the returned documents can be weighted from the most relevant downwards which gives you a prioritisation for the subsequent manual review. Furthermore, the processing speed allows you to take ranging shots – to experiment with alternative phrasings for the queries to see very quickly how the outputs compare. Sampling, both of those deemed relevant and those which are rejected, allows you to refine your conclusions and recast the query if you are not happy with the result. You cannot do that quickly or easily with a human team.

This does not purport to be a technical explanation as you will conclude from its reliance on Google and Wikipedia for its examples. The aim is two-fold – to provide a better understanding of what these applications do, and an encouragement to approach two (at least) providers of solutions like this to see a demonstration and to get an idea of the costs.

In UK civil proceedings, the rules require you to undertake only a “reasonable” search, and deciding what is reasonable is an exercise in proportionality. The purpose of these applications is commonly misunderstood – no one is suggesting that you hand over documents without looking at them. Applications of this kind are aimed at reducing the pile which must be reviewed manually by pointing you to the documents most likely to be relevant and, as a corollary, discarding those unlikely to be relevant. That is not merely helpful, but essential in any jurisdiction.  Senior Master Whitaker put it this way in Goodale v Ministry of Justice :

At the moment we are just staring into open space as to what the volume of the documents produced by a search is going to be. I suspect that in the long run this crude search will not throw up more than a few hundred thousand documents. If that is the case, then this is a prime candidate for the application of software that providers now have, which can de-duplicate that material and render it down to a more sensible size and search it by computer to produce a manageable corpus for human review – which is of course the most expensive part of the exercise. Indeed, when it comes to review, I am aware of software that will effectively score each document as to its likely relevance and which will enable a prioritisation of categories within the entire document set.

What then of the lawyers? If all this software is so clever, will it take the bread out of the mouths of those who have hitherto earned their livings from e-Discovery? Even as I was writing this article, the New York Times published an article on this very subject. That, and the reactions to it, are the subject of my next post.


About Chris Dale

I have been an English solicitor since 1980. I run the e-Disclosure Information Project which collects and comments on information about electronic disclosure / eDiscovery and related subjects in the UK, the US, AsiaPac and elsewhere
This entry was posted in Discovery, eDisclosure, eDiscovery, Electronic disclosure, Predictive Coding, Recommind. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s