Equivio spells out predictive coding basics on ESIBytes podcast

Yet again, I find myself pointing to one of Karl Schieneman’s ESIBytes podcasts as a source of timely and coherent explanations of topical eDiscovery issues.

Predictive coding inevitably dominates at the moment, thanks to the coincidence of the Da Silva Moore and Kleen Products litigation. In both cases, although in very different ways, the defensibility of predictive coding / technology assisted review / computer assisted review / call it what you will / is at issue.  Most of us who are interested in encouraging the use of predictive coding would have preferred a less confused battleground than is offered by either of these cases, and value any explanations which stick to basic propositions uncluttered by the wider agendas coming out of the cases.

Anyone who speaks from first-hand experience, whether as a provider or a user, will have a preferred product; what matters are the core concepts, and it would be odd if speakers did not use their own or their preferred products to illustrate these concepts. Here, as in his other podcasts, Karl taps the special knowledge of his speakers to draw out broader understanding.

Warwick Sharp of Equivio is a particularly lucid advocate, both of the specific components and workflows in Equivio’s Relevance product and of the wider principles – it is from him that I got the idea that the true test of a technical explanation is whether your mother, having heard it, can explain it back to you. He is one of the speakers on Karl Schieneman’s Predictive Coding and Review Roundtable recorded on 26 March; the others are Jim Wagner co-founder and CEO of DiscoverReady, and Tom Gricks, head of E-Discovery at the law firm Schnader, Harrison, Segal & Lewis, both of whom were early converts to, and are convinced users of, predictive coding where that use is appropriate to save their clients’ money without diminishing their arguments, their strategy or their proper conduct of cases.

The core explanation given in the podcast comes from Warwick Sharp – I will summarise it here, but, since the whole point of an ESIBytes broadcast is that it is free and instantly accessible, I recommend that you listen to it yourself.

Warwick’s first point is that lawyers have come late to a technology which has been widely used in other industries since the 1960s. The context is the traditional combination of blunt keyword searching and manual review which, despite its enormous cost and by now well-established inaccuracy, remains the accepted way of establishing relevance or lack of relevance in ever larger volumes of documents.  Warwick’s primary summary of predictive coding is straightforward: a human being reviews a subset of documents and technology generalises from that human input across the wider population, the output will be either categorisation as responsive or not, or a score which reflects the likelihood that documents will be responsive.

Equivio’s own workflow, he explains, is based on two key principles. The first is excellent recall and precision which is not the same as mere “accuracy”. Accuracy is no more than the number of correct hits; this is no good as a measure because overlooking the one relevant document out of 100 gives you 99% accuracy but an unacceptably incorrect answer. The second Equivio principle is statistically valid metrics.

All the speakers emphasised the importance of the assessment phase which, in Equivio’s terms, means creating both a control set, used to generate recall and precision metrics, and an experimental or training set. The training stage is terminated when the system has learned as much as it can. You then move to the decision support stage where Equivio’s tools are used to calculate the results. What is critical here is the ratio between reviewed documents and relevance – what percentage of the documents do you need to review to give you what percentage of relevant documents? That gives you the cut-off point beyond which further iterations add nothing to your understanding, and you then move to the verification phase.

I spell this out, not simply to promote Equivio (although I am always happy to do that) nor to imply that every software application carrying one of the “predictive coding” labels is the same (they are not) but because this description undermines the glib reaction, generally given by those who do not want to have to think too hard, that predictive coding is a black box with data chucked in at one end and a purported separation of sheep and goats coming out at the other.

All the main providers of predictive coding software have reported an increase in interest since Judge Peck’s Da Silva Moore Opinion was published.  Warwick Sharp specifically mentions an increase in interest in Equivio’s tools which enable the verification of training consistency.

The podcast speakers covered other topics including these:

  • Predictive coding can be used to challenge the relevance criteria of documents provided by opponents; the opening shot might be a polite observation and a request for apparently missing documents; the end result is an application to the court on an informed basis.
  • Seeding with documents deemed to be relevant or to be not relevant can be used to kick-start the process, particularly with document populations of low richness, as long as you remain aware of the risk of the self-fulfilling prophecy which comes from a pre-conception as to what might be found resulting too narrow a seed set.
  • There was a reminder that predictive coding can be used to complement other search technologies and vice versa – the use of clustering, for example, may validate (or not) the output from a predictive coding exercise, and key words have their place at both ends of the process.
  • There was some discussion about the minimum document population justifying the use of predictive coding. Warwick Sharp suggested that its value really showed from 20,000 documents and upwards, although all agreed that smaller cases would warrant its use – the defensible removal of just 2,000 documents from the review set will save a lot of money.

Good points were made also about the human element:

  • The custodian interview remains a critical component – the technology complements rather than replaces the information to be gleaned from those who created and used the documents.
  • Enthusiasm from some law firm partners will be matched by a concern for their traditional sources of profits on the part of others (this is not to be ignored, although it is reasonable to assume that the market, in the form of increasingly informed clients, will sort that one out soon enough, particularly given the attention given to these cases).
  • Warwick Sharp drew attention to the growing interest in the HR perspective – firms struggle to retain brilliant young lawyers whose ambitions do not include drowning in old-fashioned document review; predictive coding, he said, will drive that away, and lawyers can get on with the things they were trained for.
  • The main brake on main-stream adoption will be education – until people understand what recall, precision, accuracy and richness mean, they will not feel comfortable. The attitude of regulators will be critical here (it is my understanding that many regulators are already using this technology for themselves, so this may not be as much of a battle as people think).
  • Perhaps the most fundamental problem is the inherent unwillingness of lawyers to share information with opponents, and to agree to disclose things to each other. It is not good enough, really, to shrug one’s shoulders and say “Well, you know what lawyers are like, particularly US lawyers”. This aggressive taking of every point costs money, wastes court time and is in breach of specific obligations, in the US, the UK and elsewhere, to co-operate to contain the scale, time and costs of discovery. Predictive coding does not, as a pure technology matter, require the input of opponents, but its value increases immensely if parties are willing to give joint (or at least consensual) input into the training process.

This podcast is valuable for those who are prepared to admit, if only to themselves, that they do not understand the core concepts behind the use of predictive coding. It is valuable too, however, for those who do understand but feel the need for a better articulation of those concepts.


About Chris Dale

I have been an English solicitor since 1980. I run the e-Disclosure Information Project which collects and comments on information about electronic disclosure / eDiscovery and related subjects in the UK, the US, AsiaPac and elsewhere
This entry was posted in Discovery, eDisclosure, eDiscovery, Electronic disclosure, Equivio, Predictive Coding. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s