Lesson 2: The Modern Data Stack

Course: Data Engineering | Duration: 40 minutes | Level: Intermediate

Learning Objectives

By the end of this lesson, you will be able to:

Describe the four layers of the modern data stack and what each one does
Name at least two tools in each layer
Explain the tradeoffs between open-source and managed service options
Trace the journey of a piece of data from its source to a business dashboard

Prerequisites

Lesson 1: What is Data Engineering?

Part 1: What is a "Stack"?

In software engineering, a "stack" refers to the set of technologies used together to accomplish a goal. The classic web development LAMP stack (Linux, Apache, MySQL, PHP) is one example — four technologies that work together to serve web pages.

The "modern data stack" is the set of technologies used together to collect, store, process, and serve data in a business context. Like all stacks, it is layered: each layer receives data from the layer below, does something useful with it, and passes it up to the layer above.

The term "modern" distinguishes today's architecture from the data warehousing practices of the 1990s and 2000s. The old model relied on monolithic on-premises systems (Oracle, Teradata) that combined storage, compute, and transformation. The modern model separates these concerns, uses cloud infrastructure, and favors composable tools that each do one thing well.

Understanding the stack conceptually is critical before writing any code. When you build a pipeline in Section 3, you will be building an ingestion layer. When you write transformations in Section 4, you are working in the transformation layer. Knowing where your work fits helps you make good architectural decisions.

Part 2: The Four Layers

Layer 1: Data Sources

Data sources are where data originates. They are not part of the data stack — they exist independently — but they are the starting point for everything.

Common source types include:

Operational databases are the production databases that applications write to: PostgreSQL for the web app's user records, MySQL for the e-commerce order system, MongoDB for the content management system. These databases are optimized for transactional writes, not analytical reads.

APIs expose data from external services: Stripe for payment data, Salesforce for CRM data, Google Analytics for web traffic, Hubspot for marketing data. APIs return data on request in JSON or XML format.

Files and object storage include CSV exports from legacy systems, log files generated by application servers, flat files dropped by data vendors, and Parquet files produced by other pipelines. These land in cloud object storage (AWS S3, Google Cloud Storage, Azure Blob Storage).

Event streams are the real-time flow of events generated by user actions in applications: page views, button clicks, form submissions, purchase completions. These are captured by event tracking systems (Segment, Snowplow) or sent directly to message queues (Kafka).

The data stack cannot control what sources produce. It can only adapt to them.

Layer 2: Ingestion and Orchestration

The ingestion layer is responsible for moving data from sources into the storage layer reliably and on schedule.

Ingestion tools connect to sources, extract data, and load it into a destination. The two dominant categories are:

Managed connectors (Fivetran, Airbyte, Stitch) provide pre-built connectors to hundreds of data sources. You configure the source, the destination, and the sync schedule — and the tool handles the rest, including schema detection, incremental loading, and error retry. These services handle the most tedious part of data engineering so you can focus on transformation.

Custom Python scripts are written when a managed connector does not exist for a source, when the connector's behavior does not meet requirements, or when the cost of a managed service is not justified. This is where the Python skills you are building in this course matter most.

Orchestration tools schedule, run, and monitor pipeline workflows. The most widely used is Apache Airflow, which lets you define pipelines as Python code (DAGs — Directed Acyclic Graphs) and provides a web UI for monitoring. Prefect and Dagster are modern alternatives that address some of Airflow's operational complexity.

Layer 3: Storage

The storage layer is where ingested data lives. There are two primary storage paradigms in the modern stack.

Data warehouses (Snowflake, Google BigQuery, Amazon Redshift) are analytical databases optimized for reading large amounts of data quickly. They use columnar storage, which is efficient for aggregation queries that scan millions of rows but read only a few columns. Data warehouses enforce schemas, support SQL, and integrate directly with BI tools. They are the destination for transformed, structured data.

Data lakes (AWS S3 + Athena, Google Cloud Storage + BigQuery external tables, Azure Data Lake) are object storage systems that hold raw data in any format — JSON, CSV, Parquet, images, log files. There is no schema enforcement at write time; schema is applied at read time (schema-on-read). Data lakes are cheap to store large volumes of data and are the natural landing zone for raw ingested data before it is transformed.

In practice, most modern data stacks use both: a data lake to store raw data cheaply (the "bronze" layer), and a data warehouse to serve structured, transformed data to analysts (the "gold" layer).

Layer 4: Transformation and Serving

The transformation layer is where raw, ingested data is cleaned, joined, aggregated, and shaped into the data models that analysts and applications actually use.

dbt (data build tool) has become the dominant transformation tool in the modern stack. It lets data engineers and analytics engineers write SQL SELECT statements that define how to transform raw data, and it handles running those transformations, testing the outputs, and generating documentation. dbt operates inside the data warehouse — transformations are expressed as SQL and executed by the warehouse's compute engine.

Python-based transformations using pandas or Spark are used when transformations are too complex for SQL, when performance requires distributed processing, or when data needs processing before it reaches the warehouse.

Serving is the final layer: the transformed data is consumed by BI tools (Looker, Tableau, Metabase, Mode) for dashboards, by APIs for application features, or by machine learning platforms for model training.

Part 3: Open-Source vs Managed Services

Every layer of the modern data stack can be filled by either open-source software you operate yourself or managed services you pay for.

Open-source tools give you full control and no licensing cost, but you pay in operational complexity. Running Apache Airflow requires infrastructure, configuration, monitoring, and upgrades. Running your own data warehouse (ClickHouse, Apache Druid) requires the same. Open-source is the right choice when you have the engineering capacity to operate it and when the customization requirements justify the work.

Managed services (Snowflake, Fivetran, Databricks, Astronomer for Airflow) abstract away operational complexity. You pay a subscription fee and the vendor handles infrastructure, upgrades, and availability. Managed services are the right choice when time-to-value matters more than cost optimization, when your team is small, or when the operational burden of self-hosting would distract from building.

The tradeoffs are not permanent. Many companies start with managed services to move fast, then migrate selected components to open-source as scale and cost justify the investment.

In this course we focus on the Python-level skills that are portable across any stack configuration. Whether your company uses Airflow or Prefect, Snowflake or BigQuery, the pipeline logic you write in Python will be fundamentally the same.

Part 4: How the Layers Connect

To make the architecture concrete, trace a single piece of data from its origin to a business dashboard.

The journey of a purchase event:

A customer clicks "Buy" on the company's website. The application records this in the orders table in a PostgreSQL database and sends a purchase_completed event to the event tracking system.
A Fivetran connector is scheduled to run every hour. It reads new rows from the PostgreSQL orders table since the last sync and loads them into the raw schema of the Snowflake data warehouse. Simultaneously, the event tracking system streams the purchase_completed event into the data lake in S3 as a JSON file.
An Airflow DAG runs at 6 AM. One task downloads the new JSON files from S3 and loads them into Snowflake's raw schema. Another task runs dbt, which applies transformations: deduplicating orders, joining with the customer table, calculating order revenue, and populating the fct_orders table in the mart schema.
The data quality check step in the dbt run asserts that no order has a null customer ID and that total revenue today is within 20% of yesterday's. The checks pass.
The business analyst opens the Looker dashboard at 8 AM. Looker queries the fct_orders table in Snowflake. The dashboard shows today's revenue, broken down by product category, updated as of the last dbt run.

The entire journey — from customer click to analyst dashboard — was automated, monitored, and required no human intervention on a normal day. That is what the modern data stack enables.

Key Takeaways

The modern data stack has four layers: Sources, Ingestion/Orchestration, Storage, and Transformation/Serving
Sources (databases, APIs, files, event streams) exist outside the stack and cannot be controlled
Ingestion moves data from sources to storage using managed connectors (Fivetran, Airbyte) or custom Python
Orchestration (Airflow, Prefect) schedules and monitors pipeline execution
Storage includes data warehouses (Snowflake, BigQuery) for structured analytical queries and data lakes (S3) for raw data
Transformation (dbt, pandas) shapes raw data into reliable data models for consumption
Open-source gives control at the cost of operational burden; managed services give simplicity at the cost of vendor dependency

Common Mistakes to Avoid

Over-engineering early. Companies at 20 employees do not need a Spark cluster and a lakehouse architecture. A data warehouse plus a few Python scripts plus a simple scheduler is sufficient for most early-stage data work. Add complexity only when a simpler solution demonstrably cannot meet requirements.

Choosing tools before understanding data. The right storage choice depends on query patterns. The right transformation approach depends on data volume and complexity. Engineers who pick tools first and then discover they chose the wrong layer are common. Understand the problem first.

Assuming managed services handle everything. Fivetran handles extraction and loading. It does not handle transformation, data quality, or schema evolution in your warehouse. Every layer of the stack still requires engineering judgment.

Next Lesson Preview

In Lesson 3: Python for Data Engineering, you will learn why Python has become the dominant language for data engineering, which libraries you will use in this course, and how your existing Python knowledge maps to data engineering problems.

Back to Course Overview | Next Lesson: Python for Data Engineering →