Anum Basir writes for Analytics Weekly:
“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.
This is an article every executive should read about "big data". I believe it fits right in to my narrative about event with data, companies need art along with the science to have true insight as I wrote here. The article is a long read, but it details the promises of "big data" along with the pitfalls that come in only trusting the results without having the proper insights and testing.
As with so many buzzwords, “big data” is a vague term, often thrown around by people with something to sell.
I believe in more data, not a term of "big data" When people are trying to sell "big data" to corporations, are they really helping?
But the “big data” that interests many companies is what we might call “found data”, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast.
In some circumstances they might be, but what Basir discusses in this article is the idea of "found data". This is data that already exists inside the company, or data that is just not being tracked and analyzed. As I wrote here, companies are sitting on a treasure trove of data that they aren't using optimally already. Adding "big data" may send the corporation down a path they are not ready for. Always search for the next best data that will solve the answers to the questions that need answering.
Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.
Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks. Absolute nonsense.”
Basic goes on in the article to punch holes in these four claims. "Big Data" is very promising, but it is a destination for most companies. When something is a destination, there is a path that needs to be taken to get there. The path may change and there are detours on the way, but most companies can't just jump all in on "big data" or "found data". Companies must build an analytics culture, live in their data and use that data to make decisions with the business acumen they have built up for many years.
As Basir points out in the article, the problems with data do not go away with more of it, they just get bigger.
Four years after the original Nature paper was published, Nature News had sad tidings to convey: the latest flu outbreak had claimed an unexpected victim: Google Flu Trends. After reliably providing a swift and accurate account of flu outbreaks for several winters, the theory-free, data-rich model had lost its nose for where flu was going. Google’s model pointed to a severe outbreak but when the slow-and-steady data from the CDC arrived, they showed that Google’s estimates of the spread of flu-like illnesses were overstated by almost a factor of two.
The problem was that Google did not know – could not begin to know – what linked the search terms with the spread of flu. Google’s engineers weren’t trying to figure out what caused what. They were merely finding statistical patterns in the data. They cared about correlation rather than causation. This is common in big data analysis. Figuring out what causes what is hard (impossible, some say). Figuring out what is correlated with what is much cheaper and easier. That is why, according to Viktor Mayer-Schönberger and Kenneth Cukier’s book, Big Data, “causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning”.
But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.
Correlation without causation arguments do not go away with "big data". Having insights to enhance the results is key to successful analytics. We are all familiar with the story of ice cream sales and shark bites are strongly correlated, so selling more ice cream causes shark bites? Well thats just silly and obvious, we all know because sales of ice cream and swimming in the ocean increase in the summertime.
But the example brings up a crucial point, do not trust the output of the data without using your vast knowledge on the subject as a barometer. Anyone can see the shark bite, ice cream example has nothing to do with each other, but findings of big data can be a lot more tricky to detect. What may look to be a relatively reasonable explanation of data from a model a data scientist created may actually ruin a business because the data scientist had no knowledge of the subject matter. When just solely relying on data, all the great human knowledge about the business are thrown away. This is art and science. Treat it as such. Get the artists into the room with the scientists and find the best answer, not the cheapest and easiest one. Actionable analytics is hard, don't underestimate the complexity of the problem.