As any analyst knows, your reputation is more valuable than any insight you offer. One of the great pitfalls of big-data analytics is spurious correlation. As the name implies, spurious correlations are places where the data seems to indicate a link between two or more occurrences, but where there is no causal link between them. For example, if the critical variable is correlated with, say, age, any variable that correlates with the critical variable will also correlate with age. Since the primary cause of these correlations misses the critical variable, they are spurious.
Let’s try another example. First, amass all the data on car accidents in the last three years, and the damages, in dollars, caused in those car accidents; then correlate that to the number of police vehicles at each accident. Notice the correlation between the number of police vehicles and the higher-cost damages. Now, you might infer that police vehicles, in higher quantities, cause higher-damage accidents. This is a spurious correlation, since it misses the far more likely scenario, where higher-damage accidents require the summoning of more police vehicles than lower-damage accidents. This is the danger of spurious correlation: bad insights and incorrect conclusions.
There is no easy fix, as you can’t just avoid using correlations: Their use leads directly to insights and cannot be avoided. Luckily, there are ways to mitigate the risk of relying on spurious correlation and to avoid relying on dangerous insights. To reduce the probability that our insights are nothing more than the result of chance, we must break out of our comfort zones and test not just the data itself, but our perceptions of the data as well. There are some strategies I have taught to my team to help deal with real-world problems so that we can be confident that all insights given are impactful. Even better, these strategies can be applied to all types of analytics, be it finance, a/b testing, marketing and even big pharma. Most of all, these strategies are going to throw everything, including the kitchen sink, at your correlations and insights.
One of the first things you recognize in our car accident example above is how ridiculous the insight was. Too often as analysts we do not take the time necessary to examine the output when a correlation is found. An unchallenged correlation/insight (“darlings,” to borrow a writers’ term) is dangerous and can lead to loss of customers, profit and/or reputation. Step one in improving your insights and avoiding spurious correlations is to stop and look at the correlation itself. Does it make sense? Are there other correlations in the data that could explain why this correlation occurred and make more sense?
You Can’t Analyze Data in Your Pajamas
It is easy get comfortable, particularly when you look at the same metrics day in and day out. Often you use the same test for correlation as well. It’s like spending all day Sunday in your pajamas; it’s comfortable, but not very productive. One effective way to avoid spurious correlations is to use more than one test to determine significance. Try a Chi-squared test, a T-test, a test of averages and/or any other test that may make sense with your data sets. What do your results look like when compared these different ways? Does your correlation still hold up?
Bigger Isn’t Always Better
In a world where “big data” is a daily buzzword, all too often huge data sets are used to find answers and give direction. Using Google’s search results, we can see the effects of big data on correlation. In the chart below, we see the correlation of car accident searches to book reviews. This data set is monthly searches from 2004-2015. Try to imagine just how many Google searches were made in that 11-year period. We could draw the conclusion (from the 0.94 correlation using a simple T-Test) that an uptick in car accident searches would lead to an uptick in book review searches. (Data Source: Google Correlate)
If instead we reduce the sample to just 2015, it can be seen that the correlation in fact does not exist or isn’t notable (0.46 correlation). Rationally, we know these two things are unlikely to correlate.
Outside of reducing your date sampling, you can examine other slices of the data to verify your results. Try different segments, and then see where the correlation breaks down.
Variety Is the Spice of Analytics
Another trap that is easy to fall into is using the same data source over and over to get to your answers. Try something new! If you are using survey data to measure success of your new product launch, instead try reviews online, or Facebook and Twitter chatter. Get adventurous! Start blending and mixing data sources and see if there is correlation across data sources. If there is a correlation across sources, you are one step closer to an impactful insight.
The idea of “killing your darlings” is taught to writers but rarely taught to analysts. In analytics, as in writing, letting go of a prized creation can seem like an insurmountable challenge. To overcome this powerful urge to shelter our favorite ideas, use the steps above and challenge your findings. Think of each finding, each correlation and each insight as an opportunity to grow your testing skills. By getting out of your comfort zone, varying your sources and tests and identifying (and whenever necessary, killing) your darlings, you can feel good about the final findings you pass on to the end user, certain that these gems you have mined out of the data will be able to lead to impactful change.