Brief Summary
Alright, so this masterclass is all about becoming an analytics engineer, a role that's super in demand and kinda AI-proof. You'll get a solid foundation, learn the data lifecycle, and get hands-on with tools like DBT and Databricks. It's a sweet spot between data analyst and data engineer, so buckle up and get ready to upskill!
- Analytics engineering is a growing field and more AI proof.
- Master data life cycle, ELTL, Data modelling, DBT and more.
- This role is a sweet spot between a data analyst and data engineer.
Introduction
The analytics engineering domain is booming, and this masterclass is designed to help you get started, even if you're a beginner. By the end, you'll have a strong foundation, understand key concepts, and know which tools to use. Resources will be shared, so get ready to take notes and dive in!
Agenda
The agenda includes understanding the analytics engineer role, the complete data lifecycle, ELTL, blob storage versus data lakes, lakehouses, open table formats, incremental loading, backfilling, absurds, Apache Spark and PySpark, relational and dimensional data modeling, Apache Airflow and Azure Data Factory for orchestration, DBT, and slowly changing dimensions with a Delta Lake demo.
What is an Analytics Engineer Role?
An analytics engineer sits between a data engineer and a data analyst. They know almost everything a data analyst knows, plus some of what a data engineer does, making it a senior role to the former but junior to the latter. While data analysts use tools like Power BI and SQL, and data engineers handle pipelines and data warehouses, the analytics engineer works with data models, schemas, and lakehouses. This role is less likely to be automated by AI because it involves infrastructure-level knowledge and debugging, unlike report building which can be easily automated.
Complete Data Lifecycle
The data lifecycle starts with data generation from sources like mobile apps and websites, which is then stored in a transactional SQL database. A DBA typically manages this database. Next, data is extracted and loaded into a data lake. Then, data transformation occurs using frameworks like Apache Spark and DBT. Data modeling follows, creating facts and dimension tables. Data is stored at every stage, and finally, data reporting is done using tools like Power BI and Tableau. The analytics engineer role fits in the middle of this lifecycle, focusing on transformation, data modeling, and storage.
ETL & ELT
Earlier, ETL was the standard approach where data was extracted, transformed, and then loaded into a data warehouse. But with growing data sizes, ELT became popular. In ELT, data is extracted, loaded into a data lake, then transformed, and loaded again. This modern approach uses data lakes for cheap storage and allows for reusing data. ELTL involves staging layers, which can be transient (temporary) or persistent (long-term). The move to ELT was driven by data size and the need for more compute resources.
Data Lake
Data lakes are built on blob storage, where files are broken into chunks and distributed across disks. To create a data lake, hierarchical naming spaces are enabled, allowing data to be stored in dedicated folders. Data lakes offer cost-effective storage, but the quality of reports can degrade.
Data Lakehouse (Modern Data Warehouse)
Data lakehouses combine the cost benefits of data lakes with the quality of data warehouses. They use open table formats like Delta Lake and Iceberg. Data is stored in Parquet files, and a transaction log manages all transactions, enabling ACID properties. This allows SQL queries to be run on data lake data, providing the same results as a SQL database.
Open Table Format (Delta Lake)
Open table formats provide an abstraction layer that allows SQL queries to run on file formats, enabling ACID transactional properties. Delta Lake is a popular open table format, widely adopted and mature.
Delta Lake With Python
Delta Lake can be used with Python and Pandas. To get started, install the Delta Lake package and use Pandas to create a data frame. Then, use Delta Lake's writer to write the data frame to a Delta format. This generates a Parquet file and a Delta log. You can also read Delta Lake data back into a Pandas data frame.
Incremental Loading
Incremental loading involves loading only new data to avoid reloading the entire dataset. A modern approach uses a JSON configuration file to store the last load date. A query is run to fetch data where the updated date is greater than or equal to the last load date. After loading, the JSON file is updated with the latest date. Backfilling involves reloading data for a specific period. This is done by creating a dynamic SQL query with a parameter to change the last load date.
Apache Spark & PySpark Overview
Apache Spark is a distributed computing framework, while PySpark is a Python library used to work with Apache Spark. Apache Spark uses a cluster of nodes to process data in parallel, offering horizontal scaling. PySpark allows Python developers to leverage Apache Spark's capabilities.
Relational Data Modeling
Relational database (RDB) modeling involves creating tables with primary and foreign keys to establish relationships. It uses normalization to reduce redundancy. However, in data engineering, dimensional data modeling is preferred.
Dimensional Data Modeling
Dimensional data modeling involves creating a one big table (OBT) and then breaking it down into facts and dimensions. Dimensions store context, while facts store numbers. Surrogate keys are used to join tables. This model is optimized for reporting and querying large datasets.
Data Orchestration (Airflow & ADF)
Data orchestration involves building pipelines to automate data workflows. It requires expertise in monitoring, handling failures, and managing conditions. Options for data orchestration include Azure Data Factory and Fabric Data Factory, which are low-code ETL tools, and Apache Airflow, which is a code-first approach.
Data Build Tool (DBT)
DBT (Data Build Tool) is a framework that uses a SQL-first approach for data transformation and modeling. It requires knowledge of SQL, YAML, and Jinja templating. DBT solves the problem of writing complex code for tasks like incremental loading and slowly changing dimensions. It also acts as an adapter, allowing you to connect to different data warehouses and lakehouses.
DBT Account Setup
To set up a DBT account, create a free account on DBT Cloud. Then, create a new project and connect it to a data warehouse like Databricks. You'll need to provide a token to allow DBT to access your Databricks account.
DBT With Databricks Tutorial
In Databricks, create a catalog and schema. Upload a CSV file to create a managed table. In DBT, create a new project and connect it to Databricks. Create a new model with a SQL query to transform the data. Run the model to create a table in Databricks.
Slowly Changing Dimensions
Slowly changing dimensions (SCDs) are used to handle changes in dimension tables. Type 1 SCDs involve overwriting existing data, which is similar to an absurd operation. Type 2 SCDs keep the history of records by adding new rows with start and end dates and a flag to indicate the current value. Type 3 SCDs store only the previous value of a change in a new column.

