by A.J. Strollo
“Having lawyers or judges guess as to the effectiveness of certain
search terms is ‘truly to go where angels fear to tread.’”
Magistrate Judge Facciola, United States v. O’Keefe, 537 F. Supp. 2d 14, 24 (D.D.C. 2008)
This statement was made 10 years ago, and the wisdom – particularly when looking at the complexities relating to term syntax and what exists within data sets – has only become more prescient. Search terms can seem fraught, if not outright risky. So why do we continue to rely on them?
Despite the concerns surrounding keywords, and even after all the recent technological gains, they remain the most common way to cull data for potential review and production. The reason for this is likely that they are familiar, and as we all know, the legal community can be slow to move away from the tried and true, particularly when the alternatives involve relinquishing control to machines.
It’s relatively easy to generate a proposed list of terms, run them against the data, and determine how many documents the terms capture. But knowing whether the terms actually capture information of interest is a different story. Along those lines, Magistrate Judge Facciola noted that whether the terms “will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics.” Id.
Facciola may have said this because of the way lawyers often use the search results without substantive analysis. A common practice when running terms is to look at the volume of data that is returned, rather than the quality or effectiveness of the search. So, if the data returned is significantly higher than expected, the lawyer may narrow the terms arbitrarily with the goal of reaching the “right” number of documents. How they determined what is “right” can be a mystery. These adjustments may yield fewer results, but also risk eliminating necessary ones. While that’s not to say that this practice is haphazard, it does lack defensibility, especially if parties are locked in a contentious battle over the scope of discovery.
For me – and I think Facciola would agree — instead of volume, a better focus is on the effectiveness of the terms, measured not solely by number, but on the richness, or “relevancy rate,” of the potential review population.
So how do we make keywords and search terms more effective and assuage the “fears of the angels?”
A big step is to perform substantive analysis of any search terms rather than the commonly used guess and check method. When the starting point is a list of proposed terms from opposing counsel with an uncertain level of effectiveness, we must assess and refine those terms to increase the likelihood of capturing the most relevant documents. Borrowing concepts from basic statistical analysis, the process for vetting terms and suggested revisions can be based on results of a sample review. Terms are modified by targeting common false positive hits — hits on the term but not for the intended target — identified within the non-relevant documents from the sample.
Imagine a fact pattern where the relevant discussions involve Jacob Francis and his interactions with a specific contract. Initial searches for Jacob OR Francis in documents that also contain the contract title or number would yield a substantial volume of documents based on the commonality of Jacob’s name. It’s easy to label this as a bad term, but a lawyer’s analysis is helped much more by understanding why it is bad and how to make it better. Attorneys can do this by looking at the documents, which reveals that there are others at the company with Jacob or Francis in their names (e.g., Jacob Smith or John Francis), thus opening the door to an array of potential term revisions to minimize the number of documents returned. This is a good start, but the analysis does not end there.
Next, it is important to check actual document hits to ensure they are consistent with any assumptions. To do that an attorney should draw and review an additional sample from the documents that were removed from the review population to ensure the new terms are not missing potentially relevant content. Digging into these, the attorney may find out that Jacob Francis had a nickname, “Jake,” which would not be captured using the terms Jacob OR Francis to Jacob w/2 Francis. Continued analysis may also uncover references to the contract negotiation as “Project Apple” instead of the contract title or number.
Using this knowledge and adding or modifying the search to include “Project Apple” and “Jake” addresses these missing documents, avoiding potentially serious omissions. Additional considerations might include running “Project Apple” as a conceptual search rather than as a strict keyword, seeking documents that are similar in meaning but that do not necessarily share the same set of terms.
The payoff of all this work is a more focused set of documents for review, reducing associated costs, and concentrating the review team’s time working on documents in need of review. Considering the alternative of reviewing countless volumes of data unnecessarily, or worse, discarding valuable documents, it’s clear that using keyword searches – effectively – is not only necessary, but beneficial.Tags: Analytics , keyword search