How Big Data Creates False Confidence

“The general idea is to find datasets so enormous that they can reveal patterns invisible to conventional inquiry… But there’s a problem: It’s tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn’t be wrong. But the bigness of the data can imbue the results with a false sense of certainty. Many of them are probably bogus — and the reasons why should give us pause about any research that blindly trusts big data.”

For example, Google’s database of scanned books represents 4% of all books ever published, but in this data set, “The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria.” And the name Lanny appears to be one of the most common in early-20th century fiction — solely because Upton Sinclair published 11 different novels about a character named Lanny Budd.

The problem seems to be skewed data and misinterpretation. (The article points to the failure of Google Flu Trends, which it turns out “was largely predicting winter”.) The article’s conclusion? “Rather than succumb to ‘big data hubris,’ the rest of us would do well to keep our sceptic hats on — even when someone points to billions of words.”

802