New vocab: petabyte. Abbrevation is PB (no J). I thought a terabyte was immense, but a petabyte is 1K terabytes. This tacks onto a scale that starts with a bit, and 8 bits make a byte. Etc. Here’s a great explanation from Wikipedia, and you can see that there are named orders of magnitude MUCH larger than the petabyte.
Access to petabytes of data means analysis must take a different form to accomodate the sheer quantities of information. Writes Chris Anderson in Wired earlier this summer:
At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later.
So, here’s how this ramifies into science and the realm of academia:
The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
But faced with massive data, this approach to science—hypothesize, model, test—is becoming obsolete.
So, this is the paradigm shift Anderson envisions:
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
In other words, we used to employ models because data were lacking or we only had samples. Now, Anderson, says, we actually have closer to the universe of data, hence there no longer is a need for a model to guestimate the gaps.
I confess, I’m still trying to figure out how I might operationalize this (or archaeologists who are smarter than me might)…. Ideas?
29 August 2008 at 9:18 am
Not that____is interesting – may I continue to believe that correlation is not causation until I understand?
29 August 2008 at 9:19 am
I mean THAT is interesting.
I am going to think about it all. It’s a lot.
29 August 2008 at 9:30 am
I think maybe that the kind of data Anderson’s talking about, the kind that’s voluminous enough to constitute petabytes, may work the way he says, but that the less voluminous, small samples that I use would not. Still, it’s interesting to contemplate the hypothesis that we’ve reached a time when some types of data have exceeded the must-be-modeled threshold.