We know there is lots of it. We are creating data wherever we go. Data about you used to be inside the traditional business – buying stuff in shops, doing something in a bank. Some of us have been on the web for twenty years; there are plenty of other places the data is living in. The explosion of mass adoption and social and real-time data combined with system data from devices help build the mountain.
90% of the world’s data has been created in the last two years (Source: IBM 2012, Watson).
Some people talk about data in term of structured and unstructured data. I tend to not worry about this distinction – unstructured data such as video clip usually contains metadata that is useful. What’s more important is the sheer variety of data structures associated with the data – how it is laid out in its ASCII and binary form for example. This is something immense to get your heads around. Add in the semantic description of the data – both what you mean by the data, and your interpretation of what the creator might mean about the data. (You might even be lucky and get a formal published semantic specification on which all are agreed.)
Think about just some of the data trails (and the interest factor) you leave. Web server logs (telco / ISP stats); new customer registration on a web site, or via a newspaper coupon that some data entries somewhere (personal data); credit card numbers (card providers transaction histories; numbers littered across e-commerce web sites protected by fragile passwords); tweets (a low signal to noise ratio, but some insight on your mind); Facebook/Google+ et al posts (living your life on line, the ad man’s dream); Foursquare, TripAdvisor, holiday bookings (where you’ve been, what you like/dislike)…and it goes on and on and on..
Compound this with all this been done under several but similar, sometimes hidden identities. Compound this further with contexts in time. Standalone pieces of data such as a tweet are often meaningless, but in the context of some world news events for example, perhaps a different insight on your view and opinion.
Data is the new landfill.
There are volumes of this stuff, littered across the web and inside enterprises. The last private bastion of data are your personal documents and files that you have not shared on the web, living on your home disk drives (when did you last backup by the way?). The semantics of huge variety of data structures may or may not be specified somewhere.
Question is, are you going to recycle this data? Why bother – it just a sifting job that might be too huge? In business that data is very likely competitive advantage. The revenues attributed to targeted adverts from Google and Facebook are simple examples.
There might be a cost to the data (capturing by people and machines, storing, and maintaining) but there is a benefit to the insight on the information it holds. We hear the term “Big Data” – the use of technology and products such as Hadoop that can be used as part of beginning the landfill sifting and recycling exercise.
However don’t ignore the semantics and matching problems. You need to begin that sifting exercise with some insight goal in mind – meaning creating rules and strategy for combining the data, and their associated schemas – for recycling the information and finding the insight you didn’t know existed in the depths of the data.
Think about tackling the landfill by combining strategies to address at least: collection of the data, mapping data schemas to a target (but extensible) enterprise information model, raw analysis of data (such as with MapReduce), aggregations and visualisations to find the ‘islands’ or ‘lumpy’ areas where there might be something to dig deeper on.
You can’t eat this beast in one go straight away. But if you are going to capitalise on the insight that is in the landfill then you need to get started on the sifting now. The data mountain is going to keep growing.