Thursday, July 3, 2014

More Data Doesn't Mean More Interesting Data

David Beer, at Adaptive Computing, writes:
One of the keys to winning at Big Data will be ignoring the noise. As the amount of data increases exponentially, the amount of interesting data doesn’t.
He describes the problem of predicting what online video a user is going watch next, and how an analysis can quickly run the number of predictions up into thousands of possible 'next steps' to evaluate.
These are then compared with all of the other empirical data from all other customers to determine the likelihood that you might also want to watch the sequel, other work by the director, other work from the stars in the movie, things from the same genre, etc. As I perform these calculations, how much data should be ignored? How many people aren’t using the multiple user profiles and therefore don’t represent what one person’s interests might be? How many data points aren’t related to other data points and therefore shouldn’t be evaluated as a valid permutation the same as another point?
Thes points are probably the biggest value that an experienced scientist can provide to the scale of these data problems.  This kind of person has at least several years of work experience in a hypothesis driven research environment and is able to solve problems using incomplete data.  They probably have a PhD to go with that quantitative experience.

The first point, working in a hypothesis driven environment, demonstrates that that person should be able to devise a strategy to prove/disprove the hypothesis (I hypothesize that this customer will watch video Y after video X), and figure out how to do that efficiently without getting stuck in the weeds, or the irrelevant data Beer describes.  Unfortunately, it does take some skill to interview a person before you determine that they can actually do this, especially there are differences between yourself and the interviewee.

The second point, being able to use incomplete data, is something seems to come from experience.  Most people trained in research fields start off trying to collect the most data possible, and don't make a decision until 'more data is collected'.  It's easy to get stuck in a data collection rut, but eventually most people realize that it's actually OK to come to a conclusion before seeing the whole picture.

Collecting a lot of extra data costs time, resources, and puts a demand on your attention span until that elusive point of having 'enough data' is reached.  Sometimes that data is worth it, but many times it's not.  It just sits there because no one has time to do anything with it, so the data remains idle and risks becoming stale.  Unless it's actually your job to do so, be careful of making data for the sake of making data.
 
ASIDE: One of the neatest things I find about the customer analytics field (as compared with genomics or computational biology) is that data is basically being generated by the study population itself, for what is essentially free.