Why data disappoints?

Nowadays data is omnipresent, dissected by many to understand, predict and ultimately to act upon towards a desired outcome. All companies are collecting (or attempt to collect) data about their products, customers, traffic, etc. Data is used to feed Artificial Intelligence through Machine Learning to make the car drive itself; or the fridge to order milk, even before we run out of it. All companies are striving to use their data, and data science has suddenly become the sexiest job of the XXIst century!

However like everything else data needs to be used in context and with caution. One needs to thoroughly read through and understand the user manual before unleashing its power, which might lead to disastrous predictions (as they say “When nothing else has worked, read the user manual!”). And this user manual is nothing but a good book on Statistics!

Sound understanding of sampling methods, bias and error handling is mandatory to not end up with utterly unproductive results. No machine learning algorithm will be able to act as intended if the data source is not understood (some will say “clean” but what we really want, in fact, is unbiased). Or when basing a decision on data, using p-values without proper understanding of the concepts of null-hypothesis, error rate or even proper test for replication will probably do no better than a random choice!

Extremely widespread techniques can suffer greatly from a lack of proper understanding of data sampling, of biases and variances.

A clear example is represented by A/B and Multivariate testing of different versions of the same product. Who has not heard someone shouting across the room that, “I do not believe in A/B testing”! If you apply an A/B test on unrepresentative and ridiculously small samples (that are sometimes also highly correlated), then your statistic will be biased and the variance will dominate the analysis. Thereafter, when scaling up (often at huge costs), the product selected through the test will as good as a random choice between several possibilities, leading to a result that might even be inferior to its predecessor! In fact, bad implementation of this technique can just kill the whole purpose.

Averaging (or smoothing) is also a tool to be used with prudence especially when dealing with time series: averaging over large timeframes will strip out all useful trends and mask any outcome of consequence from the analysis. For instance, when validating measurements over time of KPIs conducted with different tools or methods, the intrinsic variance generated by an average over a long period of time (lets say 6 months) will prevent any constructive comparison between the two measurements (i.e. “the results agree within 5% … but with a variance of >200%!”). Switching tools will then result in totally discontinued measurements, and impair any decision taken on smaller time frames (few weeks to couple of months).

While there is currently a strong push for data analytics to become highly predictive (or even prescriptive), abuse of statistics is just destroying the ability of data to forecast the future. The end result then being a diagnosis for the business to base the treatment on often turns out to be a post-mortem report!

Data does not disappoint, unwise use of statistics does!

Leave a Reply

Your email address will not be published. Required fields are marked *