In my last post, Technology Assisted Review: An Acceptable Standard, I went through the case law and evidence to support the use of TAR. This time I am going to run the mechanics of TAR, how it works and factors to consider.
I was talking to a Partner in a top tier law firm the other day about the use of TAR and he was saying how he is currently working on a matter with 5 million documents where TAR is being used to perform review. In his firm, they have dedicated support staff who are skilled and knowledgeable in the use of TAR. It is going to become increasingly important to have people involved who truly understand how the TAR process is being applied because at some point in the discovery/disclosure process you may need to explain your actions and how you performed a reasonable and proportionate search for relevant material.
Now that TAR is considered an acceptable standard, I believe we will start to see an increase in questions being asked about how the process was undertaken and some parties may start requesting to validate their opponents results themselves. So I hope this blog will go some way to explaining in a little more detail what we are dealing with.
It’s important to recognise where this technology has grown from because this helps us better understand how it works. TAR evolved from concept searching. I remember seeing a demo of concept searching in its infancy several years ago and was blown away.
A paragraph of text was pasted into a search box and after several minutes of searching the results were returned containing a selection of documents which related to the contents of the paragraph. The standard method I had been accustomed to was crafting searches using keywords, maybe with wildcards and proximity searches, but this was restrictive due to its precision. This new method allowed you to describe what you were searching for but not necessarily having to think up every possible word which could be used.
Another development at this time, working in a similar way, was clustering. The grouping together of similar documents based on the text content. This was very powerful for both prioritising review and checking for outliers after review.
It is this same technology and concept which has now evolved into a workflow known as TAR, which is very powerful when reviewing large volumes of documents.
How it is applied?
Step 1 – Computation
Even though it is often considered “Black Box” technology and the computation may well be guarded in secrecy and protected by patents, the conceptual architecture and workflow is relativity straight forward.
The foundations of TAR are based on the position and frequency of words, this goes something along the lines of:
- Calculate the number of times each word exists in the entire data collection;
- Work out the distance between words;
- Draw a relationship between terms by the number of times they occur together in the same document.
As you can image, a massive amount of computation is required for this so I guess this is where all the smarts of the technology exist, using rules and techniques to reduce the amount of processing required.
The end result of this process is an index, much like a word index, which is used for searching. This index can then be used for multiple purposes such as near-duplicate detection, clustering, concept searching; or combined with other rules based processes such as email threading and TAR.
The thing to consider here is that this is purely a mathematical equation which does not factor in lexicons and language; as such the results are not influenced by interpretation.
Step 2: Workflows
TAR relies heavy on the human element of the workflow, and the workflow needs to be applied in the correct way for it to be effective.
Human Input – The one thing the software can’t do is the thinking, so some human input is needed to allow it to understand what we are looking for. Using the index, the software understands the relationship between words within a document, now it needs some direction as to examples of documents which are relevant. Through this learning process it can then start to apply the same logic to other documents which have not been manually reviewed by using the index.
Now, I’m not going to go into the training rounds in any detail here (it probably deserves another blog post) but as a general overview, there are iterative rounds where documents are reviewed and QA’d.
What I would like to focus on here is the learning process. There are several approaches to this and it is an area where the development in TAR technology is still evolving. There are three recognised methods which can be employed (as defined in the Grossman and Cormack studies), Simple Passive Learning (SPL), Simple Active Learning (SAP) and Continuous Active Learning (CAL).
Simple Passive Learning (SPL) – Documents are selected totally at random for both training, and for the generation of the control set (a control set is a set of documents which is used to compare the results, to calculate a level of accuracy). Documents which are not adequate for training (such as a one-line email) can be simply ignored and not used for training.
Simple Active Learning (SAL) – As with SPL, a control set is required. However, with SAL the control set can be selected either using keywords or be randomly selected. The training sets are selected using uncertain sampling – which essentially selects documents which the software is unsure about.
Continuous Active Learning (CAL) – No control set is required for CAL, however the initial training rounds are often referred to as control set as they are generally documents which have gone through targeted selection and likely to contain a high level of relevant documents. Training sets are then intelligently selected based on their likelihood of being relevant. This approach means a higher percentage of relevant documents are used for training and therefore should in theory reduce the amount of training required.
I need to point out that these techniques described above are broad and each software provider applies the method in their own unique way.
There is much debate about which is the most effective method, and this could potentially be one of the sticking points in the dispute when it comes to meeting discovery/discovery obligations. But the reality is that all three are tried and tested methods which work as long as they are applied correctly.
CAL is the most recent adaptation of TAR and arguably the most effective method of TAR as it continuously learns from previous coding decisions. The engine effortlessly becomes more precise as new documents are classified and the process negates the need for a control set which some observers see as an overhead. The main drawback of CAL is that, as there is no randomly generated control set, it is not possible to test and verify the results through statistical sampling therefore it is not possible to attribute a level of accuracy.
The Right Choice
It is very important for TAR to be applied in the correct way, training decisions need to be consistent, the facts of the matter need to be clear, certain types of documents should be excluded and the results need to be monitored throughout the process. There are a good number of considerations to make even before deciding to use this approach, it’s certainly not the silver bullet for all large scale document reviews but it is definitely an approach which should always be considered.
The choice of whether to use SPL, SAL or CAL needs to be based not just on functionality, but also on whether you have the expertise or trusted partners to work with to make it effective.
Sky Discovery consultants regularly work on TAR matters and have been involved in some of the ground-breaking cases. Our consultants are always willing to assist at any stage in the process.
Published by Martin Flavell, Director of Sky Discovery UK Ltd