Let’s first begin by understanding the term ‘unstructured data’ and comprehending how is it different from other forms of data available.
Data that also contains meta-data (data about data) are generally classified as structured or semi-structured data. Relational databases – that contain schema of tables, XML files – that contain tags, simple tables with columns etc. are examples of structured data.
Image credit: Dileep Govindaraju
Now consider data like a blog content, or a comment, email messages, any text document – say legal policies of a company, or an audio file, or video file or images, which constitute about 80 to 90% of all forms of data available for analysis. These forms of data do not follow any specific structure nor do they contain information about the content of the data. These are all classified as unstructured data.
Having talked about the proportions of structured and unstructured data, old school database analytics methods on only structured data will limit the access to just 0.5% of the information available for analysis. With technologies like Hadoop growing fast, the focus is shifting towards tapping information from this unexplored chaotic realm of unstructured data that is available in huge volumes.
How is Hadoop suitable for analysing unstructured data?
Let’s take an example of unstructured data analysis:
Consider the Video data feed from a CCTV surveillance system of an enterprise. Currently monitoring of these videos is done by humans. Detecting incidents from these videos will not only require the monitoring person to be noticing multiple video feeds, but also be attentive all the time.
Assume this monitoring process needs to be automated. The amount of data that will be fed in is huge – few Terabytes every hours. Processing close to real-time is required to detect incidents at the right time. Clearly, this will require a system that has the capability to store really heavy volumes of streaming data, very high processing speed and also the flexibility to be configured to perform any customized algorithm on the data.
Clearly Hadoop has all the capabilities listed and can be used in this scenario effectively. However, in many cases of unstructured data – mainly video/audio analysis, designing optimized algorithms to extract useful information for analysis is still a challenging problem under research. But with the way innovations are constantly being seen in the data space, we are sure to see new and improved techniques and tools in the very near future. Watch this space as the team at Jigsaw will be sure to update you on all new updates and more as and when they happen.
Interested in a career in Big Data? Check out Jigsaw Academy’s Big Data courses and see how you can get trained to become a Big Data specialist.