E-Disclosure, Needles and Haystacks 3 – Keywords | eDisclosure Information Project

This is the third article which looks at issues raised by Alex Charlton and Matthew Lavy in an article in the April / May 2007 edition of the SCL’s Computers & Law Magazine.

The opening article is here.

This section expands on the subject of keywords as a means of cutting down document populations. Keyword culls are a blunt instrument and prone to error, but may be the only cost-effective way to get the arguments heard. The CPR’s overriding objective may override the strict requirements of Part 31 – but you need to get your facts and your arguments in order.

In the second article, I looked at the source volumes available to the claimant – 8 million documents which were reduced by the use of keywords. 333 keywords were first applied. When that “failed to reduce the number of documents to a level that the claimant deemed to be appropriate for disclosure” (the emphasis is mine) a narrower range of 133 keywords was used. This brought the population down to 226,000 documents. It was subsequently reduced to 115,000 documents by the application of a new and agreed set of keywords.

I made various observations about this – as to the apparently unilateral decisions about the choice of keywords, as to the crudity of keywords as a means of hacking down the population, and as to the fallacious suggestion (as I saw it) implicit in the italicised passage above, that there is some quantity of documents which is objectively “appropriate”.

The use of keywords as a means of reducing volumes is the subject of much argument, statistical study, and ever more sophisticated technology. There is a mass of material about this on the web and I see no point in adding to it. I will point you instead to some existing material on the subject.

Let us start, though, by saying that there is no substitute for a document-by-document manual review by a skilled lawyer who knows what to look for. I was giving a talk earlier in the summer to a group of lawyers about cost-effective ways of getting documents into an electronic system for review by using a variety of relatively inexpensive tools. Someone from the audience waved a ring-binder at me and said that he could be half-way through his review by the time I had got the documents into a system (I think, in fact, that he was doing just that whilst I talked).

That of course is true – for a ring-binder or two. You could get in a couple more skilled lawyers for ten ring-binders. Could you afford enough lawyers (to say nothing of the copying costs) for 1 million documents? 10,000 documents? 5,000 documents? Have you got room to house them, a big enough budget, and time enough in hand before Disclosure must be given? Will they be as accurate as I am sure my challenger was, and consistently so as between each other?

My context in this article is a starting point of 8 million documents. Most UK litigation does not have anything like that volume. You do not need to get into the millions (or even the tens of thousands) to see the need to cut the volumes by some means which is consistent, accurate and proof against attack from an opponent or the court. How do you maintain the consistency and accuracy which you would expect from your skilled lawyers if you could afford them?

Have a look at a paper called STIR: Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval which is publicly available here. This was written by Robert Bauer and Teresa Jade of H5 and by Mitchell Marcus of the University of Pennsylvania. I will not deviate from my subject to expand on the H5 approach to the management of information, but strongly recommend its web site to anyone interested in high-level analysis of large-scale document collections.

Even at only four pages, the paper is not light reading. A document which has “asymptotic” and “reify” in the first three paragraphs is not likely to appeal instantly to a busy lawyer who just wants to get disclosure done with, and the formulae and graph on page 2 showing “Interpolated P-R curves for individual topics” are initially baffling.

It is interesting nevertheless, even for those with relatively small numbers of documents and even – perhaps especially – for those who scorn computerised Discovery and prefer to wade through paper. P is precision (that is, relevance) and R is recall (getting back at least all the documents which may be relevant for any one topic). The paper suggests that 50% P and 50% R is a better result than most Discovery exercises achieve. This, as they point out, means “a best case scenario where ½ the retrieved documents are irrelevant and only ½ the relevant documents are located”.

I will leave you to read the article. My purpose in referring to it was to suggest that just throwing a list of keywords at a large document population is likely to produce poor P and R values – it is likely to omit many relevant documents (to say nothing of key passages within documents) whilst leaving you with a lot of dross. There, says my friend with the ring-binder, this technology business is going to miss half the documents. The context, however, both of the case I am discussing and the H5 paper, is more documents than the most assiduous lawyer is going to wade through manually, at least if he is to do it cost-effectively..

I refer you next to an article which appeared in Digital Discovery and e-Evidence in November 2005. It is probably not coincidental that I found it on the H5 website and that H5 were the vendor which carried out the exercise reported in the article. It is interesting for its report on a retrieval exercise where human reviewers were pitted against technological solutions for the identification of key documents. The theme of the article is the reduction of volumes by the use of technology, the cost-savings which result, and the benefits beyond actual discovery. These include one of my hobby-horses, put thus: “Getting more relevant information early in the process puts attorneys in a much better position to determine case strategy and gives them a much stronger basis from which to negotiate with the opposing side”. The formal discovery obligations, in other words, are not the only reason why we sort and analyse documents.

My last reference for you is the web site of either Attenex or Recommind, both large companies whose business is automated extraction of what matters from large document collections. They are not the only such vendors, and I hold no personal brief for either of them, but they are important players in this market. Of more direct importance in this context, they have well-ordered pages of useful material both on the problems and on their solutions.

We left the Needles and Haystacks case at the point where, after three sets of keyword filters had been applied, the document population had been reduced to 115,000 probably, in fact, by the use of Attenex or something like it. The recipients were still not happy with it, and set a team to manual review. The whole story could be used as an argument against electronic means of giving Discovery / Disclosure.

The authors do not draw this conclusion. I will look separately at what they offer as lessons from their experience. What interests me (and I assume you, if you have stuck with me so far) is the whole business of finding the needles in the haystack, of getting the largest number of relevant documents from the smallest complete sub-set of the original document population, and doing so economically.

The aim is the same whether the population is 8 million or 5,000, particularly if you assume that the budget is more or les proportionate to the volumes. The resources worth applying to that population will vary – multi-million pound cases can turn on a finding the right documents from relatively small collections, and low-value cases may involve vast populations (and may have to be abandoned for precisely that reason).

The danger in referring to learned articles and to the web sites of sophisticated searching tools is that one tends to overlook the routine cases. The danger in too close attention to the US experience is that discovery there has become a battleground of its own. The danger in references to the “smoking gun” or the “killer document” is that relatively few cases involve anything so exciting. The danger of paying too close attention to the formal disclosure obligations on Part 31 of the CPR is that the overriding objective gets lost in the minutiae of arguments over keywords.

I am not suggesting that one ignores the bits of the CPR which do not suit the client’s case or his pocket. But if the volumes mean that justice cannot realistically be done by close adherence to the strict obligations then, in a relevant case, it is worth considering – in good time – whether one’s opponent might be persuaded to agree to a narrower scope for disclosure.

Similarly, for all that they are a blunt instrument in searching terms, one can agree a list of the keywords which will hack down the population and give a reasonably high chance that anything relevant will be disclosed whilst minimising the pool which is analysed. This sounds a bit rough and ready, and may offend the purists, but it may be the only way that the merits of a dispute can be adjudicated – and a rough and ready justice is better than none at all, which is what happens if a party decides that it cannot afford to have its claim or its defence aired in court..

The important words here are “in a relevant case”, “considering”, “in good time” and “agree”. If you pole up to a Case Management Conference with:

an informed case for ignoring whole chunks of otherwise potentially disclosable documents, or for a big pre-emptive cull using a list of keywords and
a well-argued reference to the overriding objective and
evidence that you have tried to persuade your opponent to agree

…then the court is likely to listen. Whether you win the point or not depends on whether you can show that your methods are very likely to leave in play the documents on which the case is likely to turn. The key is being informed about the documents and the likely impact – in costs terms as well as evidential terms – of a keyword cull.

In summary, the bare use of technology isn’t the only way to cut the costs of Disclosure. The informed use of computerised tools and techniques might be coupled with a sensible use of the Rules to make Disclosure more manageable, more cost-effective, and more in line with the overriding objective, albeit at the cost of some brutally broad hacks through the document jungle. The secret lies in early preparation and being as well informed about the documents as you are about the issues (actually, the secret lies in not letting the potential population reach 8 million, but that takes us beyond our immediate subject).

I have no idea whether this would have been a practicable route in the case under discussion or, indeed, how much of this was done. These article are not intended to criticise the conduct of a case of which I know no more than is described in the orignal article. There are in any event, no universal answers. Many cases demand a very high degree of precision and recall from enormous populations and the use of applications like Attenex or Recommind, or the services of H5, is the most practical – and sometimes the only – way to tackle them.

Most cases, frankly, do not. They need an intelligent appraisal of the sources before anyone touches a keyboard or a ring-binder, a degree of co-operation between parties, an informed presentation to the court in default of agreement – and perhaps some brutal weeding by keywords or a similarly broad-brush approach. But that weeding must be either consensual or judicially imposed, not unilateral.

If you would like help with any of the matters discussed in this article, please do not hesitate to contact me.

Home

E-Disclosure, Needles and Haystacks 3 – Keywords

About Chris Dale

Leave a comment Cancel reply

Categories

Recent Posts

About this site

Contact

Chris Dale
eDisclosure Information Project

Tel: +44 (0)1865 463033
Mobile: +44 (0)7770 580640
E-Mail: chrisdaleoxford@gmail.com