Introduction

  1. What is a Data Lake?
  2. Why Data Lakes?
  3. Data lake Architecture
  4. Conceptual Data Lake Architecture
  5. Data Lake vs Data Warehouses
  6. Data lake implementations
  7. What are the benefits of data lakes?

1. What is a Data Lake?

When the data world expanded exponentially with the advent of better storage technologies and algorithms to handle massive data, basically the emergence of big data, which is based on the school of thought that any and every data can be potentially useful, the idea of Data lake took shape. A data lake is just what it literally conveys, a huge centralized pool that structured and unstructured data can be dumped in, with appropriate pipelines planted in to tap this data at massive scales.

The hitherto process of deciding which data is vital to business and how much is good enough to make sense is discarded here and you are allowed to store your data as it is, without any necessary or predefined structure. This raw data is available for tools that can work on semi-structured and unstructured data, apart from the structured data and provide insights through dashboards, visualizations, reports generated even at real-time, helping management to take timely decisions. This scale of data is what Machine Learning and Artificial Intelligence thrive on. 

So when we say unstructured data, what all does it encompass? Well, it can be binary data like videos, images etc. Semi-structured data include data from tweets, comments, logs, blogs, JSON files and plain text. Structured data, obviously comes from sources like relational databases, neatly stacked in rows and columns. James Dixon, the chief technology officer at Pentaho, coined the term Data Lake when proposing it against the Data Marts solution prevalent at that time and even now in the data warehouse based architectures.

2. Why Data Lakes?

What problem does a data lake solve?

Data lakes were conceptualized based on the idea that any data may have a value worth mining information out of. Data lakes are used by data scientists and analysts to mine and analyze large amounts of Big Data.

3. Data lake Architecture

To truly get to understand Data Lake architecture, lets first look at how an EDW or traditional Data Warehouse is structured.

A high-level representation of a traditional EDW.

The idea behind a data warehouse is to extract, transform and load structured data that is perceived to have useful information for the business. At the end of the process is the presentation layer with reporting, visualization and analysis tools, that workout data from data marts or OLAP cubes to provide insights into the perceived data of the business domain.

When it comes to scaling up to Big Data levels, this infrastructure faces some stiff challenges.

The time it takes to first understand the data and find out the source system and its structure, cardinality, data modelling based on business requirements, data cleansing, exploratory data analysis and so on, is not a good fit for the amount of data that big data brings in and at the speed with which the data comes in.

Since the exponential proliferation of data all around us, the definition of analyzable data is no longer the same and thus an alternative model like Data Lakes makes much more sense.

So here is the Data Lake conceptual architecture

4. Conceptual Data Lake Architecture

Data lake architecture at the conceptual level.

All structured and unstructured data are brought into a raw data store after they go through an extract and load process, remember there is no transformation. It is important to note that the data storage here is commodity hardware-based storage which is economical to procure at this scale. The analytical sandbox is a logical module that is used for exploratory data analysis and application of data science tools to build various models including predictive ones. All the data in this system is catalogued and curated with secure governance policies. Notice the real-time processing engine processing data streams that need to be responded in real-time to make some sense out of such data and react in real-time.

5. Data Lake vs Data Warehouses

Data lakes provide a solution to the massively altered data scenes since the EDW started.

Let’s go through the differences between the approaches.

  • The EDW way is to line up business requirements and model the data based on these requirements. So understanding the data is an important initial step. On the other side, Data Lakes are not to interested in understanding the data. Knowing the ecosystem and preparing for the various types of data is sufficient in this setup.
  • A lot more work is required before data can be loaded in a structured format on DWH, but with Data Lakes, the data is stored directly in their raw formats inappropriate storage systems.
  • DWH are highly optimised for highly structured relational data, but Data lakes will take in all types of data.
  • DWH lack the flexibility that data lakes exhibit in case there are changes in the requirements.

6. Data Lake Implementations

Data Lake implementations can be a stagewise deployment in an agile fashion.

Stage 1

Setup a basic data lake that accepts raw data from all identified sources.

Stage 2

Apply data science tools and methodologies to come up with analytical data models.

Stage 3

Run a parallel setup alongside the enterprise data warehouses, with storing low priority data as well.

Stage 4

Take over from data warehouses, and make the data lake as the single source of all enterprise data.

7. What are the benefits of Data Lakes?

The most valuable advantage of Data lakes is the flexibility it offers in terms of adjusting to changing business requirements and changing data ecosystems. Other major benefits of data lakes are listed here.

  • Scalability

With the ever-increasing amounts of data, it is important that your business has the right setup to handle this avalanche of data. Data lakes make it possible to scale your business to enterprise-level and further.

  • Capture and react to real-time data in near real-time

Tools that are a part of big data infrastructures like Kafka and Flume can acquire data that are coming in at great speeds and volumes. Data lakes can use these tools to pool in all that data for further analysis.

  • Save on precious deployment time

Exploratory data analysis and modelling take place when and where it is needed, unlike in the DWH architecture where it has to be done even before data is ingested.

  • Schema-less data ingestion

Data lakes will allow data to be stored in a schema-less way which is a useful feature at the time of ingestion, particularly high-speed ones.

  • Advanced Analytics 

The large volumes of data are ideal for Machine Learning based analytical tools to analyse large sets and produce accurate statistical models used further in predictive analytics.

  • Elimination of data silos

Data Silos are a major bone of contention in the DWH world, but in Data Lakes you always access the same data existing in one unified view, accessible to all users with the right credentials.

Conclusion

Data lakes are the way to go for large enterprises if they intend to capture every bit of data that is flowing through their business to make sense of it and extract value and stay ahead of competitors.

If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional. 

ALSO READ

SHARE