My good friend Q just wrote a post for the New York Times. It was a profile of another company — not mine, boo — but it captured a lot of the interest in big data. Was a great read.
There was a comment on his Facebook post about “spurious correlations” and how big data companies never have a good answer to how to avoid spurious correlations. This is half of a good question.
So I sit at the edge of my bed
I strum my guitar and I sing an outlaw love song
--Social Distortion, "Story of my life"
Now pay attention, my seven readers, because this will be an ongoing theme. Ready? Here we go.
Given enough cases, almost any difference will be “statistically significant”. But, honestly, who cares? Even if something is likely not random — that is, it is significantly different — if the something has a small enough impact, it’s probably not worth paying attention to.
This is a BIG DEAL: Instead of looking for significance, you should be asking for “effect size”. A small, but significant, effect size doesn’t tell you much about behavior, it just tells you about the vagaries of statistics.
Most statistics that people use to measure “significant differences” place a number of assumptions on the underlying data. The two most common — and egregious — assumptions are normality and linearity. The normality assumption requires that underlying data fit a certain distribution and that there are no underpinning hidden relationships between data. In interesting domains — like, most all — nothing is normally distributed. Oops, that stat is wrong. Second, many factors in the real world are related in underpinning ways — there is a relationship between educational attainment and the *mother’s* educational attainment, but not the father’s. Oops, if that’s true, that stat is even more wrong.
The problem is that all statistical tools will spit out a significance number. And tools that use statistical tools often repeat those numbers, without paying attention to whether they are correct.
There are loads of problems with spurious correlations. But they are mostly subsumed by the problems of effect size and statistical correctness.
The entire good question? “What’s the effect size and how did you compute significance”. Ask the question at your next dinner party policy discussion. You’ll either inspire awe or make a bunch of people think you’re a horrible geek.
// Side note: I am the second. No doubt. //