Machine learning to anticipate eDiscovery not just to manage it

Jim Shook of EMC takes us back to the stage before discovery. The advanced technology used for dealing reactively with discovery requests has its place at a much earlier stage in the process.

Judge Peck’s opinion in Da silva Moore passes into a kind of limbo pending its review by Federal Judge Carter.  The analysis of the present position has been exhaustive and, to some extent, repetitive, and those of us who comment on these things have little more to say until Judge Carter does his stuff. We are waiting, too, for the next step in the Kleen Products case before Judge Nolan. It is a bit like one of those uneasy patches on the French battlefields of the Great War as everyone waited for the whistle signalling the next big push.

It is a good opportunity, perhaps, to look in a more rounded way at the broad class of technology which, whether you call it predictive coding, technology-assisted review, machine learning, or whatever, connotes generally the idea that computers learn from a mixture of rules and previous inputs  in order to “predict” what should be done with documents, classes of documents or, perhaps, whole servers full of documents.  The technology being developed for this, and for similar functions which have nothing to do with discovery,  has many of the same characteristics  and objectives as the pure discovery applications. Marketing intelligence, news sites which point you to related articles, shopping sites which suggest alternative purchases and (as Judge Peck noted) anti-virus software, all include elements of this kind of prediction.

Jim Shook – James D Shook, Esq, to give him his full title – is an eDiscovery expert at EMC. EMC was one of the first information companies give effect to the now obvious conclusion that discovery is can be seen as an end-use for the data which companies keep in ever-growing volumes anyway. Its roots are in archiving and storage, and the intelligent enterprise content management (ECM) found in its Documentum and related products. It acquired Kazeon in order to give its ECM customers a seamless run from document creation through to eDiscovery, the latter coming from its EMC SourceOne eDiscovery Family.  Its website page EMC SourceOne File Intelligence has the subheading Informational Growth, Storage Requirements, and Organisation Risk and sits in a menu which extends backwards into the data centre and forward into legal hold. It is this breadth which gives Jim Shook the conclusion to his article Machine Learning for Document Review: the Numbers Don’t Lie

Most of that article is an analysis of the use of predictive coding for discovery in cases like Da Silva Moore.  Like many of these articles, it refers to the paper Technology-Assisted Review in eDiscovery can be More Effective and More Efficient than Exhaustive Manual Review by Maura Grossman and Gordon Cormack. Helpfully, it gives page references in that paper for the particular points which Jim Shook wants to make.

It is the last paragraph of Jim’s article which I want to focus on, with its suggestion that “predictive coding technologies show promise outside of the litigation process to help with our information management overload issues”. The technology already exists, Jim says, to apply automatic classification to information as it is received or created, and “improvements, higher comfort level and better understanding of the technologies caused by their use in litigation will help with the adoption rate”.

This is a tangible illustration of what we mean by “information governance” or, at least, of a subset of that expression.  in addition to efficiency gains and reduced storage costs, it implies that the information we keep can be limited right from the beginning to the that which we are likely to need, whether for business purposes or for eDiscovery.

Of the 3.2 million documents which are the Da Silva Moore starting point, only a fraction will prove of value to either party in this litigation, and most will never have served any useful purpose since shortly after their creation. The simple maths which Jim Shook sets out in his article – of documents for review multiplied by the review time per document – are clear enough and require (rather than merely justify) the use of all available tools to reduce the review load.

The cost incurred defensively for a one-off purpose which has nothing to do with the company’s business must be taken into account when considering an investment in the sort of technology – intelligent pre-emptive technology – which Jim Shook refers to in his last paragraph.


About Chris Dale

I have been an English solicitor since 1980. I run the e-Disclosure Information Project which collects and comments on information about electronic disclosure / eDiscovery and related subjects in the UK, the US, AsiaPac and elsewhere
This entry was posted in Litigation Support. Bookmark the permalink.

1 Response to Machine learning to anticipate eDiscovery not just to manage it

  1. pegduncan says:

    Regarding the information governance dilemma, there’s an intriguing statistic floating around that states that the majority of data in large environments — as much as 99% — is stored but never accessed again, because the desire to retain data in a long-term archive for compliance purposes. Some variants of the figure crop up frequently, but it has been difficult to track down the original research behind the figure. However, there is one research paper – Measurement and Analysis of Large-Scale Network File System Workloads, Andrew W. Leung, Shankar Pasupathy, Garth Goodson, Ethan L. Miller, 2008, available from University of South California, Santa Clara,, retrieved March 20, 2012, that found:

    6. Files are rarely re-opened. Over 66% are re-opened once and 95% fewer than five times.
    7. Files re-opens are temporally related. Over 60% of re-opens occur within a minute of the first.
    8. A small fraction of clients account for a large fraction of file activity. Fewer than 1% of clients account for 50% of file requests.
    9. Files are infrequently shared by more than one client. Over 76% of files are never opened by more than one client.
    10. File sharing is rarely concurrent and sharing is usually read-only. Only 5% of files opened by multiple clients are concurrent and 90% of sharing is read-only.

    Our data are definitely cooling.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s