How useful are keywords in refining document populations? They can be a blunt instrument, but it may be proportionate to use blunt instruments as long as everyone involved is aware of the method used. What does it all mean to the man on the Birmingham omnibus?
It may be reasonable to search some or all of the parties’ electronic storage systems. In some circumstances, it may be reasonable to search for electronic documents by means of keyword searches (agreed as far as possible between the parties) even where a full review of each and every document would be unreasonable. There may be other forms of electronic search that may be appropriate in particular circumstances.
This passage comes from Paragraph 2A.5 of the Practice Direction to Part 31 CPR where it is part of the expanded definition of the scope of a reasonable search. It is all a bit clunky, really, in that this part of the PD was a belated add-on to Rule 31.7 CPR (the duty of search) and actually repeats part of that section. At the least, it is tiresome to have two overlapping sources for the same obligation. At worst, this is one of the reasons for the tacit agreement to ignore the whole subject which has been the norm hitherto.
I am also a bit puzzled by that word “even” – the idea that it may be reasonable to use keyword searches even where a document-by-document review is not possible surely inverts the logic – that is exactly the circumstance in which you do need the help of computer-aided searches.
Be that as it may, this part of the Practice Direction is authority, if such be needed, for the application of computer search technology to the task of cutting down large document populations. One needs more than just authority, of course – a solicitor must satisfy himself, his client, his opponent and the court that the result amounts to a reasonable search, not just for the sake of formal compliance with the rules, but because of the possibility of omitting something vital, or including something for which privilege should be claimed. One of the balancing factors, of course, is that the large bunch of humans who might otherwise be set to the task would be both expensive and, being only human, prone to human error.
All this has recently been analysed in a recent paper by Bruce Hedin of H5. Called Searching in all the wrong places: the Effectiveness of Search Tools in E-Discovery, it surveys the performance of search tools and other information retrieval methodologies in establishing an adequate level of recall (how many relevant documents have been identified as such) and precision (how many of those are actually relevant). I use the word “relevant” in its loosest sense – see Relevant is irrelevant to standard disclosure.
I was introduced to this via a report of a judgment of US Magistrate Judge John Facciola in U.S. v. O’Keefe, No. 06-CR-249 (D.D.C. Feb. 18, 2008). Judge Facciola (whom I have met – one of the clearest judicial thinkers in this area) says:
Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics … Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread. This topic is clearly beyond the ken of a layman and requires that any such conclusion be based on evidence that, for example, meets the criteria of Rule 702 of the Federal Rules of Evidence.
FRCP Rule 702, which concerns expert evidence, is presently the subject of deep debate as to the difference between opinion and fact which I am happy to leave the jurists to discuss. The practical questions for everyday disclosure are of more pressing concern to those confronted by a mass of documents which they cannot sensibly read page by page.
All this plays differently depending on your position. In the US, you can be tripped up and sanctioned for vast sums because your methodology is deemed less than thorough. If, on the other hand, you try telling the Birmingham Mercantile Court that you need very expensive processing tools to handle a £50,000 claim, you will be sharply reminded that proportionality – the balance between input and costs needed to arrive at justice – is the main consideration (they will talk about “proportionality” in the US as well, but they will mean something very different).
The technology varies in sophistication and cost. A straightforward keyword search will find “conspire” (in the unlikely event that anyone has so described his actions), and may also find “conspiracy”, “conspirator” and so on. A concept engine will go much further. Recommind recently showed me a search including “brownout” which brought back “blackout” as well, and such tools will go further and tell you what it thinks are the dominant words and concepts instead of expecting you to spot them first. I have described elsewhere how DocuMatrix, given a handful of sample documents, will go away and come back with other documents which have the same characteristics, rated on a scale which allows the user to pick a cut-off point below which documents can be safely excluded – “safely” connoting that same computer-assisted but human-made balance of review cost versus the chance of missing something.
All the software solutions are based on different algorithms and, in consequence, may produce different results. It is this which alarms the lawyers. If Product A produces a very different result set from Product B, how can they say they have done a thorough search? It is perhaps some consolation to know that the traditional roomful of lawyers is certain to have produced yet a different result, that a different roomful would come up with yet another, and that the same teams would produce different results on Monday morning and on Friday afternoon. At least the same software tool should be consistent on any day of the week.
In the US, they batter each other to death over this sort of thing. One suspects that some of this “my search algorithm is better than your search algorithm” stuff is driven more by tactics or by the need to justify sunk investment – if you had just bought a fancy bit of kit, you would be keen to promote its virtues over others.
Sometimes, of course, there really is a difference which matters. A judge may really have to decide between two very different results. If H5’s research points towards using their extremely sophisticated mix of intellect, process and technology, well, there are cases which need it, just as some buildings warrant a top-flight architect.
What does it all mean to the man on the Birmingham omnibus, with a set of documents which are bigger than he (or the judge) will want to be reviewed but which he can’t afford to read? Bruce Hedin’s article may be aimed at a higher level of understanding than you aspire to but, even read superficially, it throws light on what the issues are. The next step is to pick up the phone to a supplier, describe the problem and get a quotation. Estimate the costs of doing it by any traditional means, weigh up the risks and benefits, and there is the agenda for your next discussion with your opponent.
I simplify deliberately, but this is what the rules (specifically Paragraph 2A.2 of the Practice Direction to Part 31 CPR) require – that is, the discussion with the opponent is compulsory, the rest is necessary groundwork.
Tactically speaking, the mere fact that you show that you are addressing the problem, can recite the PD Paragraph 2A.2 obligations and can show even the lightest knowledge of PD Paragraph 2A.5 and of the issues which Hedin addresses, may send your opponent into a state of nervousness from which settlement discussions may emerge.
But first you have to make the call to a supplier. There is overlap between the respective functions just of the two mentioned above, quite apart from any others. If you don’t know who to ring, contact me.