Apr
2013
Big Data is Big Deal
Many of us are saddened by the Boston Marathon bombings and are relieved that the ordeal has come to an end. Or, has it? I think each of us will take our own time to reflect on the events, digest both the reliable as well as the mis-information that is being directed us from all directions, and derive our own conclusions. As I wrote in my last post, various technologies played important roles in identifying the suspects and eventually capturing one of them. They brought to light several important things – explosion of technologies, how the law enforcement relied on distributed technologies (video tapings from sources other than Law enforcement), social media and crowd-searching (crowd sourced searching), and thermal imaging.
Frankly what got lost in all of these discussions is how every one of these items is far more complicated than the positive aspects which helped us in the end. And most importantly, what led to the surviving suspect was an actual curious human being and not the technology. Quite obviously, every step of the way, there were pitfalls – privacy, security, misuse of captured information, dangers of subjectivity arising from crowdsourcing the search whi has a high probability of the wrong people being implicated etc. etc. And the massive data that was helpful in cases like this and others is the “Big Data“.
For an excellent information on Big Data, I strongly recommend this from Kara Miller’s very accessible InnovationHub piece on “How Big Data Changes Lives“. Also, please read an excellent blog post on the short history of Big Data and this from IBM.
In some sense, the growth in data storage comes from three sources – one is the large data sets collected primarily by researchers either through experiments or through long computer simulations; data collected by commercial organization used for profit generation; the other is the massive explosion in data storage by individuals coming from highly affordable large storage devices.
Remember the days when we were boasting about how we bought that 100 GB drive on Amazon for a great price? Now you can buy 1 Terabyte of storage for just under $100 from Amazon. When you can buy more for less, why not? This is what we do with our own computing devices. Most of us actually don’t need the explosive computer power of the computers we own and use, but we are forced to buy them because that is the only game in town whether we need it or not. The same way, we purchase these humongous data storage devices and continue to store everything there, documents, pictures, and music. We never bother to delete them, because, one day we may actually use them! (Good luck trying to find them that one day)
“Big Data” is also a subjective term in that for individuals, the growth from megabytes to terabytes in such short time period is a “Big Data” explosion. However, for scientists who are collecting astronomical data with images with much greater resolutions or genomic data that are being collected and analyzed in minutes compared to years, the growth has been from terabytes to petabytes or even exabytes. FYI – 1 Terabyte is one trillion bytes, 1 Petabyte 1000 Terabytes and 1 Exabyte is a million Terabytes.
The issue that everyone is facing today is how to store, retrieve and analyze these humongous data in reasonable time. Since our original computer systems never imagined this kind of explosive growth, our databases, as good as they are, were not meant to handle terabytes of data. Incremental changes have been made to accommodate petabytes today. But some radical rethinking is necessary. It should not come as a surprise that Google is at the forefront of some of this which they call BigQuery.
So, having “Big Data” may sound great, but the bottom line is we need better systems to access them and analyze them quickly. It is not only enough to be able to store and retrieve fast, but we also need better and faster algorithms to deal with this mountain of data. This is where Quantum computing and Genetic Computing and the algorithms that are being developed there are critical. Of course, these remain theoretical at this point. But, when they become real, Grover’s algorithm tells us that a search can be implemented that is much more efficient than the most efficient “classical” algorithm available today! Of course, Genetic algorithms are being currently exploited in Artificial Intelligence.
Big Data used by our faculty (whatever our definition of Big Data may be) need to be preserved and made available to other researchers if the data was produced by federally funded grants. See NSF Data Management Plan for details. Of course, it used loaded terms such as “reasonable time” and “incremental cost”. A whole bunch of incremental costs suddenly add up to big cost. In other words, we are under obligation to maintain and make available these data and it does not come cheap. When we are severely resource constrained, how do we do it?
The other issue with Big Data is – how much of it is used & how often. Storage companies have already thought about this to develop algorithms that move the data around from primary to secondary to tertiary to eventually offline data storage based on the staleness of data. But who is worrying about the authenticity and integrity of these data? In other words, the user may assume that regardless of where the data is stored, it can be accessed 5 or 10 years from now. How confident are we that the data will stay intact for that long? SHould we not have a plan to scrub some of these data? If so, who makes those decisions?
There is a parallel in real life in the libraries – weeding collections. What criteria to use, who makes those decisions etc.
And then comes all the negativity around big data. How insurance companies can datamine electronic medical records to set rates or even deny coverage, how surveillance devices of all kinds are out there without any of us knowing and all the potential misuses, how some of the “bad guys” can tap into surveillance devices, how cell phone providers can use tracking data in ways we cannot even begin to think about. Unfortunately, many of these develop much faster than policies and they also happen to be one way street.
There is no going back to small data! It is all Big Data and soon Huge Data. It is a Big Deal and learn to live with it. When you walk around in big cities watch what you do. And don’t be surprised if you start getting emails from a marketer based on what you are wearing (or better yet what they think you should be wearing)!