As the name implies, the World of data science is, by definition, dependent on data, and that is certainly the case when it comes to machine learning modelling in the context of eDiscovery. Be it supervised or unsupervised learning, the machine has to cut its teeth on relevant, rich data to be able to build something that reliably delivers accurate results and thus can deliver on the objective of accelerating the EDRM process.
The potential for AI technologies to wade through the vast volumes of electronic documentation and communications is increasingly being recognised. Still, the reality is that unless we have pre-built models where those models are built on relevant content, we are always faced with having to build afresh each time, which rather nullifies the advantage of letting the machine do the heavy lifting.
Terminology in property law, for example, is very different across the World and whilst its meaning may well be largely the same, a model built from content in one jurisdiction may fail miserably in another. The same will probably hold true for industry-specific terminology not being readily transferable and of course the challenge is compounded across multiple languages or where regional variations or lexicons exist.
The Opportunity for the Trusted Advisor
With this in mind and against a growing backdrop of privacy and confidentiality, finding publicly available, representative training data sets for anything can be a problem. After all, when did you last sit through a demonstration of an eDiscovery technology that didn’t use the aged Enron data set?
This is where the law firm can add value.
Law firms, with their position as trusted advisors, often have access to client content within their document management estate, or indeed as part of an eDiscovery or litigation exercise they are conducting. And it is this which provides the key to the problem described above.
Computers don’t do language. They do numbers. Think, therefore, of the building of AI models as being the process of converting language (with contextual insight) into corresponding mathematical constructs. Be it supervised (i.e. with human intervention) or unsupervised (algorithmically looking for previously undetected patterns), in simple terms the resultant model then decides whether a new example falls to one side or the other of a mathematical divide, along with a measure of confidence. By that stage, any client confidential materials that contributed to the building of the model are totally obfuscated, represented only by a mathematical vector.
Building a Solution
By adopting the latest technologies and then through the normal process of eDiscovery analysis and review, the law firm can build machine-learned models, either for generic use or specific to their specialist areas of practice, industries and jurisdictions, models which are then transferable and can be re-used on later matters. Not only can this accelerate subsequent reviews, but it also overcomes the problem of having to re-build models every engagement. It so strengthens the added value that the firm can deliver.
Vendors can and do provide “out of the box” models, and it may prove advantageous to use these as a quick start to seed your model. For example, a model that is trained to detect bullying behaviour may be sufficiently generic to identify some relevant content in a data set, but local language variations between say American English and British English may mean that it is not as effective as you might want. But by building on top of that existing model through the provision of positive and negative feedback, simply as a function of your review process, a more pertinent model can evolve and be saved.
If you’d like to find out more about how Salient Discovery can help you with building machine-learned models for eDiscovery and Cognitive Analytics purposes, contact us here.