Fundamentals — Cloud Data Lake Architecture

Sanjeeb Mohapatra
2 min readApr 26, 2024

--

Enterprise Data Warehouse (EDW) system is very good in storing highly structured data generated from transactional systems. However, at present most data are generated from weblog, mobile devices which are highly unstructured or semi-structure in nature. To store, process and extract insights both from relational and non-relational data, there is a need for another system. This leads to evolution of cloud data lake.

A cloud data lake is a central repository which is highly scalable, secure, fault tolerance where organizations can store and manage peta bytes of data of different kind like

1. Highly structured (relational data), row-column based tables.

2. Semi structure data (JSON, XML, sensor data etc.)

3. Unstructured data (Audio, Video, PDF documents, emails etc.)

Data from different data sources whether structure or un-structure data can easily be ingested to data lake and store in original format. Data can be transformed or enriched to desired format for business use.

A cloud data lake takes the advantage of different data services for ingesting, processing, transforming data from source to target and mostly followed decoupled architecture where data is stored in cloud storage and process is happen using SQL or cloud based tools like Glue, Apache Spark. For consumption different business intelligence tools like Tableau, Power BI can be integrated with Cloud data lake for Visualization.

In this blog we will see a cloud data lake logical architecture in AWS cloud. There is NO demo in this blog, however in future blogs we will create small projects using this architecture.

Description

The above details provide a very high level flow and details on cloud data lake architecture. To expert in this field, a detail deep dive is require for each AWS services.

Note — This blog is influenced by the book “Data Engineering With AWS” by Gareth Eagar

--

--