Data Engineering — Understanding Snowflake architecture from a beginner prospective!!
Snowflake is the most popular cloud data warehouse solution at present. It enables data storage, data processing and advance analytic solution easier to implement and highly flexible compare other traditional and cloud data warehouses.
The following are key features of Snowflake data warehouse
1. It is a self-service and managed service which means there is NO hardware to select, manage and do patching activity.
2. All administration work like upgrade, maintenance and tuning will be managed by the Snowflake.
3. End user will concentrate on the application developments and easy to deploy the code in prod quickly and get the value.
4. Snowflake runs completely on cloud environment. At preset it support AWS, GCP and Azure.
5. Snowflake uses virtual compute instances for its compute needs and a storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted). The more details can be found — https://docs.snowflake.com/en/user-guide/intro-key-concepts
In this blog, we will explore and understand the snowflake architecture from beginner level.
Snowflake architecture is based on hybrid model (traditional approach where disk or storage has been shared and shared nothing where computed can be separated). For shared-disk architecture, snowflake uses a central repository to store the data (S3 in AWS, Azure blob storage in Azure, Google cloud storage in GCP). The persisted storage is accessible by shared nothing compute node (processing layer) in the snowflake platform. This is called virtual warehouses in snowflake world. Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This enables the simplicity in the data management process and performance can be improved with scale-out benefits of shared nothing architecture.
The Snowflake architecture consists of 3 layers i.e storage layer, Query Processing layer and Cloud service layer. The below diagram depicts the snowflake architecture and the diagram is from official documentation page of snowflake.
Let us discuss each layer in detail.
Layer-1: Database Storage Layer:
When all source system data are ingested to snowflake, snowflake reorganize the data into internal high performant, optimized, compressed, columnar format. The data is store one of the cloud storage and it is configured by the user when open the snowflake account. The popular storages are AWS S3, Azure blob storage and Google cloud storage. Snowflake manages all the details of the files like metadata (data about data), statistics (file size, compression). The files stored by snowflake is not visible to users directly however the data is accessible via native sql using snowflake interface or client tools.
Layer-2: Query Processing Layer:
All the end user query is processed in the query processing layer. Snowflake process all queries using Virtual Warehouses. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider. This is muscle of the snowflake platform. Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses. This refers to the shared nothing architecture.
Layer -3: Cloud Services Layer:
The cloud service layer is the brain of the snowflake platform. It is a collection of different services which coordinate each other and enable in process user requests, from login to query dispatch to the Query processing layer for execution. This layer is also run on the compute node provided by the cloud provider. Services managed in this layer include:
· Authentication
· Infrastructure management
· Metadata management
· Query parsing and optimization
· Access control