The Panama Papers story gives us a very big and high-profile example of how the processing and analytical power of software like Nuix can uncover facts and links which would it otherwise take large teams of humans years to find, if they found them at all. The principles are exactly the same for much more mundane data management tasks which organisations have to do.
To some extent, the human brain has evolved to meet the ever-increasing volume of information which comes our way. Where I used to sit down and read The Times from cover to cover, I now scan my Twitter timeline quickly, subconsciously looking for a subliminal list of keywords which reflect my interests, or for a word or picture to trigger a new interest. Some takes me off down avenues which had not entered my head. Some, perhaps, gets missed in the process, either missed entirely or misunderstood. So it was that I flicked past the words “Panama” and “leaks” and mentally pictured large ships on their sides in a long dry cutting; the two words near each other caused me to assume that the Panama Canal had run dry.
The nature of Twitter is that other stories, or copies of the original story, come by pretty fast and I was soon clear that this story was about a different kind of leak. I made another kind of mental leap, and mentally assumed that Nuix would be involved in the story somewhere, this story having stirred memories of an earlier one. Others, meanwhile, were making their own sideways jumps as more detail emerged, and in no time at all the Prime Minister of Iceland had resigned and our own PM was being forced to tell us about his personal financial affairs. There will be more to come.
Investigations work like that. You start with a set of core facts and work outwards and sideways from them. You go down blind alleys like my picture of the dry canal. Some facts emerge from the data – that Person A has connections with Corporation B; others come up because someone remembers or knows of a link which others do not; yet more emerges under questioning inspired by the starting point, gets added back into the mix and may trigger yet further searches and discoveries. That is how Sherlock Holmes or Agatha Christie plots work.
Even Holmes would need help with the 11.6 million documents from Panama which fell into the hands of Sueddeutsche Zeitung and the international Consortium of Journalists (ICIJ) which is where Nuix comes into the story. This is not simply because Nuix is particularly good at trawling through very large volumes of data, but because we have seen Nuix involved in a variant on the present Panama leaks story before. That was in 2013, when 260 GB of data were leaked to the ICIJ in 2.5 million files. I refer to it because the description of how it was handled is more comprehensive than anything we have yet seen about the Panama leak.
The story is told in a 2013 article called Intercontinental collaboration: How 86 journalists in 46 countries can work on a single investigation. A quick skim of the data showed that it was about the financial affairs of individuals from more than 170 countries. A team of journalists from around the world was set to work on the data, using software licenses donated by Nuix. This distribution of the task was not simply a way of finding manpower. It was appreciated that this kind of investigation needed a mixture of computer processing and analytical power plus the human knowledge of people with special expertise in their regions. Those behind the project were seeking interesting connections and, however good computer software is at sniffing out connections, human knowledge of the kind gathered by investigative journalists is needed to supplement the knowledge and help steer the investigation down avenues which are only interesting because somebody, somewhere knew of a link.
The data was “very messy and unstructured” and it was not enough simply to throw some keywords at it in the hope of uncovering information about people already deemed to be “interesting”. Duncan Campbell of ICIJ is quoted in that article as saying:
Many members started by feeding in lists of names of politicians, tycoons, suspected or convicted fraudsters and the like, hoping that bank accounts and scam plots would just pop out. It was a frustrating road to follow. The data was not like that.
One of the special skills of the Nuix software is to identify “named entities” and link them together. There is a good explanation of named entity extraction in an article called Textual Analytics: Named Entity Extraction by Nuix’s CTO Stephen Stewart.
This is much more complex than simply searching for keywords because it takes you beyond your predefined starting point. Investigators acquire immense power if the software can, for example, point up a telephone number, a postcode or a sum of money which recurs across multiple entries. The mere fact that it recurs may not be important, but it might be; often, only a human will be able to infer a connection deeper than the mere shared telephone number.
A TechWorld article called The Panama papers – how big data blew the lid on global elite’s financial secrets gives us an introduction to the way Nuix’s data analytics were applied to the Panama papers. The process described in the article turned 2.6 TB of data raw data into “something that could be queried to spot deeper connections, patterns and relationships between people, events, locations through time”.
The first stories resulting from the leaks are outwardly significant – political leaders of everything from corrupt dictatorships through to apparently worthy democracies have been concealing money; others have reasons which will vary from corruption through criminal money-laundering and down to tax avoidance and the concealment of assets from spouses. Most of the names coming out so far are ones we would expect to find – it is not news that David Cameron’s father had assets in Panama, for example. All investigations have their prime suspects. What is interesting about bringing this level of investigative power to this volume of data is that other and less expected names will start to surface. ICIJ’s Walker Guevara put it this way in the 2013 article referred to above:
Alongside many usual suspects, there were hundreds of thousands of regular people — doctors and dentists from the U.S. It made us understand a system that is a lot more used than what you think. It’s not just people breaking the law or politicians hiding money, but a lot of people who may feel insecure in their own countries. Or hiding money from their spouses. We’re actually writing some stories about divorce.
That is also the nature of many corporate and criminal investigations. Most of them start by trying to supplement a hunch or a suspicion with some hard fact. Where it gets interesting is when the data and its connections takes you to places and people not yet on the radar.
It is perhaps some comfort to lawyers to realise the importance of the human component in these searches where the backbreaking work is done by software like Nuix, leaving the human investigators with new leads to follow up. The Panama papers story is one of International high profile. Exactly the same principles and methodologies apply to more routine investigations over smaller datasets, whether for criminal, regulatory or internal investigations purposes.
In other words, you don’t need 11.5 million documents to justify using Nuix nor, indeed, do you need a crime, or a regulatory investigation as the trigger for its use. I took part in a panel with Nuix a few days ago at their London Insider Conference; our subject there was the new General Data Protection Regulation, the GDPR, and the theme which emerged is that organisations have got to get very much better at identifying personally identifiable data in among all the other data which they keep. This is because the mere keeping of it brings responsibilities and burdens quite apart from any urgency which may arise for particular purposes such as an discovery demand. The explanation of named entities to which I have linked above would be as useful in that context as in the Panama story.