Big Data

“Big data” is a relative term in that at least for the past 25-30 years, with the advances in digital technologies, the collected data would look “big” relative to say, a year ago. However, we are currently at a point where the rate of growth looks far more than ever before. According to WikipediaBig data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” I believe that all along, we have been keen to collect as much of the data as possible and worry about processing and analysing them later. The act of collecting data has gotten so easy these days that no one wants to get rid of any of these data with the hope that the data will be useful in some fashion, some day!

On the one hand, the availability of such vast quantities of data is very exciting for researchers. On the other there are many issues about analyzing and processing them that need to be sorted out. And then, we need to worry about a whole slew of associated secondary issues such as the misuse of data, long term storage of the data and the impact on the society because of misinterpretation of the data.

Data collection is done in a couple of different ways. One defines a problem or set a goal and then worked towards collecting the data to help solve the problem or achieve the goal. The other is to analyze the data that has been already collected to help answer the questions. The difference is that in the second case, you may be using data that was not collected with a specific goal in mind and therefore the conclusions have to be vetted carefully. In the first case, typically, you think through the problem, set up sampling or data collection methods in ways that you may have controls against which you are testing an assumption. In the second, you may not have clear controls, for example, and therefore any conclusions that you derive will need extra caution.

I strongly encourage you to read the NY Times Op-Ed titled  “Eight (No, Nine!) Problems With Big Data“. It provides specific examples on the dangers of the big data hype. For exmple, it is so easy to find spurious correlations when you have so much data to analyze. The data shows strong correlation between United States murder rate and Internet Explorer between 2006 and 2011. This is a funny example, but, given the quality of IE, I can be convinced that this is not spurious after all! Just kidding. The other issue is the failure of Google Flu trends in recent years. The basic point is, just the availability of vast amounts of data does not guarantee much, unless careful thought goes into it in terms of harnessing it properly. The hype almost always leads to disappointments.

We all have experienced the pain of data cleanup even with the small datasets we have to deal with. In addition, collecting data without a plan ahead will result in either wrong data being collected or data being collected with less accuracy than desired and possibly amplifying the noise in the data. Issues such as these, sample size and spurious correlations are the major factors as to why the traditional analysis methods cannot be directly applied to big data.  Scientists are working hard to develop new methods to solve these issues and for a review of some of the methods, you can read the paper “Challenges of Big Data Analysis“.

In “Why Big Data is a Big Deal“, you will read about some real life examples that illustrate the promise that Big Data holds, when care has been exercised.

Now, on to some real life examples. As we all know, the combination of smartphones and other mobile devices are sending so much data about us to many places that we may not even know about. Of course, the phone companies have this data and there have been a few good things that has happened in terms of locating lost phones or even last known location of a lost person, but for the most part, people fear that they don’t even know how this data is being used. What if it is used in the wrong way and the spurious correlations profile you as a terrorist? Once such things are done, it is so hard to get out of them, so one hopes that those who are using these data exercise enormous care and are knowledgeable in data analysis techniques.

The likes of Netflix, Google and Amazon are masters at Big Data analytics. The other day I sent a couple of possible dates for travel to a friend’s house from my GMail account and within seconds, Google Now on my phone started showing me the fares to Charlotte from Boston. Brilliant I thought. My friend thinks this is too spooky. These are really applications of big data analytics in that the learning systems which do this need to have enormous training datasets to be able to do this correctly. Netflix and its recommendations based on your viewing habits or even profiling (based on who you are and comparing it to other similar people) is another marvelous addition. However, I wish they pay attention to detail. They might want to send me my recommendations on a thursday or friday based on the fact that I watch things only on weekends!

I generally like the recommendations from Amazon because they are so well done. However, Amazon needs to understand that if I just purchased a pair of shoes, I am unlikely to replace them in 3 weeks, so please, don’t do some stupid things just because you have the data!

Big Data does not always have be about numerical data. Whether it is scholarly texts or enormous amounts of social media text, there is plenty there for analysis. And whereas new systems are being developed for their analyses, they have their own set of issues.

The discussion about Big Data can itself be big, but I will stop here.

1 Comment on Big Data

  1. Layli Maparyan
    September 17, 2014 at 12:46 pm (10 years ago)

    Hi, Ravi: I really enjoyed this blog, because I am still trying to wrap my head around this nebulous concept known as “big data.” The level of public and scholarly discourse makes it obvious that “big data” is important, but, at times, it doesn’t seem like people are talking about the same thing. Your blog is clarifying in this regard. Thank you!

    Reply

Leave a Reply