Using the genalg Package in R to optimize Legal Costs. This is a lesson (Moderately Advanced) from our Data Science for Lawyers Workshop. We would appreciate any feed back from you.
An introduction to the Correlation Coefficient in Legal Data. Particularly in Data sets with Legal events and permutations expressed as coherent numerals.
We provide Basic to Advanced Interactive Workshops in: • Data Analytics • Machine Learning Algorithms & Artificial Intelligence • What Data Science is and what its impact is & will be on Legal Services • How to adapt your practice to an increasingly Data-Driven world • How to use Data Science to boost Productivity, Efficiency, Speed and Cut Costs
- Our Workshops include but are not limited to: • Interactive demonstrations using in-house Company Data • Professionals participating in analyzing their own Company Data to unearth insights • Learning how to build Statistical Models that Predict Outcomes and Classifying eventualities • Core Data Science Techniques being taught to participants varying from Basic to Advanced levels • Learning how to apply these Data Science techniques to enhance every-day Legal Processes
I became a Data Science practitioner in probably the most peculiar, counter-intuitive way of becoming one. Before being confronted by the embroidered Python or R dilemma, before being exposed to Data Analytics Platforms, Math and Statistical theory; I concerned myself primarily with what I can best term “Data Transmutation”. I intuitively knew that before I could do Data Analytics, I had to somehow transmute the Data. I would primarily be mining Legal Data; its unstructured nature and textual rigidity would compel me to develop a way of modifying it into a form that is Analytics-receptive. This was before I even knew what ETL was.
I soon began quantifying Legal permutations and expressing them as mathematically weighted numerals for the purposes of efficient Data Mining. This method is designed specifically for Machine Learning Algorithms that require numerical attributes and weights for their calculations. I however realized that beyond mathematically quantifying Legal Data, it was in fact altering the Data completely and (to my surprise) not just Legal Data. That is how Data Transmutation as I understand it, was born.
Data Transmutation creates a synthesis of multiple events that are first encapsulated into a single expression, then transposed into a math function and finally transmuted into a coherent data point. It condenses Data into mathematically calculable numerals and symbolic expressions that weight the factual permutations of events and occurrences. The results are highly potent Data points; it’s like condensing Lite Beer into a liquid with an alcohol content of 100%. This distillation of Large Data sets into highly concentrated but rational Data points is very helpful. Data Transmutation is a bit like gene extraction. All data tells a story (some stories more exciting than others) and every story has a core phenotypical structure and genome. Data Transmutation is a way of delineating the Data into “DNA” strands that represent the foundational archetype of the story the Data is telling. Transmutations mine the primary consequence of events and occurrences by isolating the systemic functions of data; thereby extracting only salient truths. Think of the process of Diffusion in Biology, which means something that goes from a very high concentration to a low one once it expands and occupies larger spaces. Through Diffusion, a gas loses its potency and efficacy as it begins to spread, very similar to what happens during the collection and architecture of Data. When you Extract, Transform and Load Data, you’re essentially taking a series of events and fragmenting them into scalable features for the purposes of Algorithmic enquiry. This fragmentation increases density, widens factual parameters, increases variation and ultimately “diffuses” the efficacy of the story. However when Data is transmuted into an condensed form; factual parameters aren’t unnecessarily expanded, the features become more salient, the density remains the same and variation is kept at a healthy level.
There is of course Sampling, Feature Optimisation, Feature Generation (combinational vector creation) and many other tools which are all used to perform the function of distilling data into a state of optimum lucidity. However there is a difference between Transmutation and Segmentation, which the above tools essentially are. They minimize and optimally abridge the Data to be analysed, they do not fundamentally mutate it.
Discretization methods are ubiquitous on all good Analytic platforms, that is probably the closest you can come to changing the aesthetic identity of data points without using Data Transmutation. You could certainly use Discretization to convert numerical attributes (where some entries are “0”) into binary attributes detailing “Yes” or “No”. This however cannot be done without compromising the structural and probative integrity of those data points.
Nevertheless, the definitive feature of Data Transmutation is the ability to mathematically calculate the value of a transmuted data point, without using Analytics to do so. Consider a classification model for Gold for example. One of the data points under the attribute “Pressure and Temperature Data” has been Transmuted from 27.0 GPa of Sheer Modulus (original data point), into a mathematically weighted expression of (P+) 4.833 (Transmutation Value). Because the data point has been delineated into a math function, it is possible to calculate the mathematically representative value of Sheer Modulus, in short-hand form, without using any code or software: with pen and pad. Think of Machine learning Algorithms that have the ability to produce Formulas for their results or at a very abstract level, even Map Reduce; Data Transmutation works in a similar way.
I am not saying that this method of altering data is a divine panacea, just like a Machine Learning Algorithm there are conditions and parameters that it must satisfy to perform optimally. All I am saying is that there is a way of changing data to facilitate a more advanced method of Machine Learning. Unfortunately Transmutation will inevitably lengthen the already protracted ETL process, however the rewards are bountiful. As Data Science practitioners we should let go of the fear of “corrupting” data. Change it as you see fit and you may be pleasantly surprised by the results.
Sometimes I don’t trust Data Science, probably because my duty of care is more pronounced on account of working mostly in Legal Analytics. You see as an Analytics Practitioner in the Legal field my Data Science methodology cannot afford to yield wild guesses, these are people’s lives I’m dealing with. You have to be very careful with Legal Automation, if you build a Classification model for a State Prosecutor and you miss something, even a very small thing, the results will be cataclysmic. For Analytic practitioners in Law it’s not just about finding subtle or abstract insights that boost efficiency, you are venturing into the inner most sanctum of human life and its consequences. This is not a whimsical Analytic project that some companies venture into because of all the hype around Data Analytics. You know the type, not really knowing what they want out of Analytics but hoping that they’ll know it when they find it in that elusive golden nugget their vast data holds?
At one of the major Banks the minimum desired accuracy of a Classification Model is 85%. That is the standard for them and many Data Scientists as well. That level of accuracy is terrific in every other field, except Law. In my line of work it would be a mistake to take a gamble on a model that has a 15% probabilistic margin for error; its probative value is simply inadequate.
The statistical volatility of Data Science sometimes requires other instruments to supplement an Analytic process, for me that supplement is Math. Mathematics is a necessity in our Analytics practice, rather than a peripheral and elective tool, which is unheard of for Lawyers but it’s true. I simply cannot rely on Computational Algorithms alone; if I did it would diminish the veracity of the results of my Analytic projects. This led me to develop a series of fairly elaborate Equations and Formulas that we solve before and after the Analytic process. These math functions have enabled a breed of “Guerrilla Analytics” that have become a staple for us. They are applied to a clients’ data and the results of those calculations are the values that make up a typical Data set for us. So while most Analytic practitioners will clean, architect structured and unstructured data then ultimately model it while keeping the data values mostly as is, we employ a different approach. By the time we are done with our ETL process, the Data will be unrecognisable; this is because the Equations we use transmute the Data into a form that only our math functions will recognize and can rationalize. For example if we take a Data Set of quarterly revenue, and a particular entry is $40000, after our math calculations it will no longer be $40000, but something like “(P+) 7.33”, and that is not meant to denote its “weight” in Data Science terms either. If it is raw Legal Data, a particular averment in one of our clients Pleadings could be illustrated as “(N-) 0.333”, an answer our formulas arrived at. This is a pain staking but worth while process and the calculations will a lot of the time be done by hand(I’m old school like that). Other times they will take the form of an equation in a Matrix, again by hand and then transformed into a computation thereafter.
One area that has benefited remarkably from the math equations is our Trial Simulations. Simulating a Legal Trial using Algorithms is an enormously difficult and complex task, one that you simply cannot embark on competently using Traditional Data Science tools alone. Postponements, the introduction of new Evidence, uncovering new facts, cross examinations are all factors that can single-handedly derail any Analytic Model on any Data Science platform you can think of. Surprises like this are just far beyond any Parameter adjustment or Machine Boosting technique. This is especially the case when practicing real-time Litigation Analytics in an actual Trial. If something happens unexpectedly, you need a short- hand technique to quantify those sorts of permutations right then and there, ergo, summarily deploying Analytics in the quickest way possible. Unfortunately in a situation like this there is no time for an ETL process, Data Cleansing or Architecture; this is Guerrilla Analytics and the math functions we’ve developed make it happen.
Now I know that I face imminent attack by Data Science purists when I say that sometimes I don’t trust Data Science on its own, but Machine Learners have their own peculiar biases and dispositions, some of which can jeopardize Legal Analytics. I would never use a Support Vector Machine alone if I’m building a Predictive Model for the Attorney General of a country which informs his decision to prosecute a citizen or not. In an instance such as this one, it would be absolutely criminal (maybe even literally) to pursue Data Science recklessly or without some sort of supplementary tool.
Data Science forms the very substratum of an Analytics Practitioners’ work, it’s what sets us apart from Statisticians or Mathematicians. However in some instances we cannot rely on it alone, we need to employ other measures to increase its definitiveness. In any event I am sure many Data Scientists use math and other means to augment the potency of their Analytics, some not even scientific at all. It is undeniably prudent to do so where necessary, especially in fields that demand a higher standard of accuracy and care.