Lesson 8: Data Engineering Roles and the Tooling Landscape

Course: Data Engineering | Duration: 35 minutes | Level: Intermediate

Learning Objectives

By the end of this lesson, you will be able to:

Describe the five main roles on a modern data team and what distinguishes each
Name the categories in the data engineering toolbox and the primary tool in each
Explain why this course teaches raw Python rather than starting with frameworks
Map your learning path through Sections 2–10 to specific job-ready skills

Prerequisites

All previous lessons in this section (Lessons 1–7)

Part 1: The Data Team Ecosystem

Modern data teams are made up of several distinct roles. These roles overlap significantly — especially at smaller companies, where individuals often span two or three roles — but the distinctions matter for career planning and for understanding who builds what.

Data Engineer

The data engineer builds and maintains the infrastructure that makes data usable. Their primary work is pipelines: extracting data from sources, transforming it into reliable data products, loading it into destinations, and keeping it all running reliably.

Data engineers own the data infrastructure: the orchestration system, the data warehouse, the object storage layer, the ingestion connectors. They define the standards for how data is structured, tested, and monitored. They are the first line of defense when a pipeline breaks.

Skills that are central to the data engineer role: Python, SQL, cloud infrastructure (AWS/GCP/Azure), data modeling, pipeline design, distributed systems concepts.

Analytics Engineer

The analytics engineer is a relatively new role, popularized by the rise of dbt. Analytics engineers sit between data engineering and data analysis: they take the raw or lightly processed data that data engineers deliver and build the clean, reliable, well-documented data models that analysts use in their daily work.

An analytics engineer writes dbt models — SQL transformations that define how raw data is shaped into fact and dimension tables. They define business logic (what is the official definition of "active user"?), document data models, and write dbt tests to validate data quality.

Analytics engineers typically have strong SQL skills and understand business logic deeply. They may have less infrastructure experience than data engineers and more analytical domain knowledge.

Data Analyst

The data analyst answers business questions using data. They write SQL queries, build dashboards and reports, interpret results, and communicate findings to non-technical stakeholders. They are data consumers, not data producers — their work depends on data that data engineers and analytics engineers have made available.

Analysts need strong SQL skills, understanding of visualization tools (Looker, Tableau, Metabase), and the ability to translate ambiguous business questions into specific, answerable data questions. Python is increasingly useful for analysts (pandas for data exploration, matplotlib for visualization) but is not always required.

Data Scientist

The data scientist applies statistical methods and machine learning to extract predictions and deeper patterns from data. They build predictive models, run experiments (A/B tests), and develop recommendation systems, churn models, fraud models, and other ML-driven features.

Data scientists need Python and/or R, statistical knowledge, ML library experience (scikit-learn, PyTorch, TensorFlow), and understanding of experimentation design. They are heavy consumers of clean data and frequently frustrated when the data engineering layer is unreliable.

MLOps Engineer

The MLOps engineer is the infrastructure specialist for machine learning. They build the systems that train, deploy, monitor, and retrain ML models in production. This role is adjacent to data engineering but focused on the ML lifecycle rather than the data pipeline.

MLOps engineers work with ML training infrastructure (Kubernetes, GPU clusters), model serving systems, feature stores, and model monitoring. They bridge software engineering, DevOps, and ML.

The overlapping responsibilities reality:

At a startup with five engineers, one person might do data engineering, analytics engineering, and data analysis. At a large company with 50-person data teams, these roles are distinct career tracks. Understanding the distinctions helps you communicate clearly about scope and helps you identify which specialization you want to develop.

Part 2: The Data Engineering Toolbox

A full-stack data engineer needs working familiarity with tools in each of these categories. You do not need to master all of them simultaneously — mastery comes with practice — but you need to understand the landscape.

Orchestration

Orchestration tools schedule, run, and monitor pipeline workflows.

Apache Airflow is the dominant orchestration tool in the industry. Pipelines are defined as Python DAGs (Directed Acyclic Graphs). Airflow provides a web UI for monitoring, supports complex dependency graphs, and integrates with nearly every cloud service. Its operational complexity is high but its ubiquity in job postings makes it essentially required knowledge.

Prefect is a modern Python-native orchestration framework that addresses many of Airflow's operational pain points. Pipelines are ordinary Python functions decorated with Prefect primitives. Prefect Cloud provides managed infrastructure.

Dagster takes a data-asset-centric approach: you define the data assets your pipeline produces and Dagster infers the execution graph. Dagster has strong built-in support for data quality checks and asset lineage.

Ingestion

Ingestion tools move data from sources to storage.

Fivetran is the leading managed connector service. It provides pre-built, maintained connectors for hundreds of data sources. You configure the source credentials and destination — Fivetran handles extraction, schema mapping, and incremental sync. Cost is based on monthly active rows synced.

Airbyte is an open-source alternative to Fivetran with a large connector library. You can self-host Airbyte to avoid per-row pricing, at the cost of operational overhead. Airbyte Cloud is the managed option.

Custom Python scripts remain essential when managed connectors do not support a source, when the connector's behavior does not meet requirements, or when cost does not justify a managed service. This is where the skills in this course apply most directly.

Transformation

Transformation tools clean, model, and shape data into analytics-ready form.

dbt (data build tool) is the dominant transformation tool. Engineers write SQL SELECT statements that define how to transform raw data. dbt handles running the transformations in the correct dependency order, testing outputs, and generating documentation. dbt operates inside the data warehouse using its compute engine.

pandas handles transformation logic that is too complex for SQL or that must run outside the warehouse (Python-based ETL). Pandas is the workhorse for in-memory data manipulation in custom pipelines.

Apache Spark handles transformations on data volumes that exceed single-machine memory. PySpark is the Python API for Spark. Spark is used at companies with very large data volumes (terabytes to petabytes).

Storage

Storage is covered in depth in Lesson 6. The primary tools are:

Snowflake, Google BigQuery, Amazon Redshift for structured analytical data in the data warehouse layer.

AWS S3, Google Cloud Storage, Azure Data Lake for raw data in the data lake layer.

Delta Lake, Apache Iceberg for the Lakehouse pattern on top of object storage.

Data Quality

Data quality tools validate pipeline outputs and monitor data health.

Great Expectations is the leading Python framework for data quality assertions. You define "expectations" (the equivalent of unit tests for data) and run them against DataFrames or database tables.

dbt tests are SQL-based quality checks integrated into the dbt transformation layer: not-null tests, uniqueness tests, accepted-values tests, relationship integrity tests.

Monte Carlo, Bigeye are commercial data observability platforms that monitor warehouse tables for anomalies (unusual null rates, unexpected row count changes, schema drift) automatically.

Serving

Serving tools make clean data available to consumers.

Looker, Tableau, Metabase, Mode are BI (Business Intelligence) tools that connect to data warehouses and allow analysts to build dashboards without writing SQL for every query.

REST APIs built with FastAPI or Flask serve data to application features that need database results at runtime.

You do not need to master every tool before your first data engineering job. Employers care most about: SQL fluency, Python/pandas for data manipulation, one cloud platform (AWS, GCP, or Azure), and clear understanding of pipeline design. Tool-specific knowledge can be learned in weeks — foundational skills take months to develop.

Part 3: What This Course Covers vs What It Doesn't

This course deliberately starts with raw Python rather than with frameworks like Airflow or dbt. Understanding why this is the right pedagogical choice also helps you understand what you are building toward.

Why raw Python first:

Airflow is a framework for orchestrating Python code. If you do not understand how to write correct, testable Python pipeline logic, you will not be able to write good Airflow DAGs — you will only be able to copy examples from documentation without understanding why they work.

dbt is a framework for running SQL transformations. If you do not understand the transformation patterns — deduplication, slowly changing dimensions, aggregation, window functions — you will not be able to write good dbt models.

Great Expectations validates data against assertions. If you do not understand what makes data invalid and how to express quality criteria programmatically, you will not be able to write meaningful expectations.

Frameworks encode best practices developed by experienced engineers. Learning the frameworks before learning the underlying practices means learning the answers without understanding the questions.

This course teaches the underlying practices using Python directly. By Section 7, you will have built complete pipelines in raw Python. When you then learn Airflow (in your next role or through dedicated study), you will understand immediately what problem it solves because you will have already solved that problem manually.

What this course covers:

Pandas fundamentals for data manipulation (Section 2)
Data ingestion from APIs and files (Section 3)
SQL and database connections with SQLAlchemy (Section 4)
Transformation patterns: joins, aggregation, reshaping (Section 5)
File formats and serialization: CSV, JSON, Parquet (Section 6)
Building complete end-to-end pipelines with proper error handling and logging (Section 7)
Data quality testing with Great Expectations and unit tests (Section 8)
Orchestration concepts and patterns (without a live cluster) (Section 9)
Capstone project: a complete production-quality pipeline (Section 10)

What this course does not cover (and why):

This course does not run an Airflow cluster, a Snowflake warehouse, or a Kafka stream. These are infrastructure setups that require cloud accounts, billing, and operational management that would distract from learning fundamentals. They also change frequently as hosted services evolve.

The pipeline logic you build in this course — the Python functions, the transformation patterns, the quality checks — is directly transferable to any orchestration or storage tool. The fundamentals do not change when the infrastructure changes.

Part 4: Your Learning Path from Here

Completing this section gives you the conceptual vocabulary and mental model. Sections 2–10 build the hands-on skills systematically, each section building on the previous.

Section 2 (Pandas Fundamentals): You start writing code. By the end of Section 2, you can load data from a CSV or database, inspect and explore a DataFrame, filter and select data, handle missing values, and apply functions. This is the foundation that everything else builds on.

Section 3 (Data Ingestion): You build your first real extractors — Python scripts that call REST APIs, handle pagination, manage authentication, and save results to files. This is the "Extract" in ETL/ELT.

Section 4 (SQL and Databases): You connect Python to SQL databases using SQLAlchemy, execute queries from code, and load results into DataFrames. This covers both extraction from operational databases and loading into analytical destinations.

Section 5 (Transformation Patterns): You master the pandas operations that appear most frequently in real pipeline work: joins, aggregations, reshaping, window operations, handling real-world messy data.

Section 6 (File Formats): You learn to read and write CSV, JSON, and Parquet files efficiently. By the end, you understand why Parquet is the standard for analytical workloads.

Section 7 (Complete Pipelines): You assemble the pieces — extraction, transformation, loading — into complete, production-quality pipelines with logging, error handling, retry logic, and clean code structure.

Section 8 (Data Quality): You add quality infrastructure to your pipelines: Great Expectations assertions, unit tests for transformation functions, and monitoring patterns.

Section 9 (Orchestration Concepts): You learn the DAG pattern, dependency management, and scheduling without operating infrastructure, preparing you to learn Airflow on the job.

Section 10 (Capstone): You build a complete pipeline from scratch that applies every skill from the course to a realistic data engineering problem.

By the time you finish this course, you will have the foundational skills to contribute meaningfully to a data engineering team on day one of your first role in the field.

Key Takeaways

The five main data team roles: Data Engineer (pipelines and infrastructure), Analytics Engineer (data models and dbt), Data Analyst (business queries and dashboards), Data Scientist (ML and statistics), MLOps Engineer (ML infrastructure)
The data engineering toolbox spans six categories: Orchestration (Airflow, Prefect, Dagster), Ingestion (Fivetran, Airbyte, custom Python), Transformation (dbt, pandas, Spark), Storage (Snowflake, BigQuery, S3), Quality (Great Expectations, dbt tests), Serving (BI tools, APIs)
This course teaches raw Python first because framework knowledge is shallow without understanding the underlying practices
The three skills employers value most for entry-level DE roles: SQL, Python/pandas, and understanding of pipeline design
Sections 2–10 build hands-on skills systematically — each section is a building block for the next

Common Mistakes to Avoid

Tool obsession. New data engineers often spend excessive time debating or learning tools before they have built anything. The tool landscape changes constantly — the underlying skills (writing clean pipeline code, designing for idempotency, ensuring data quality) do not.

Ignoring SQL. Many aspiring data engineers focus on Python and underestimate SQL. In practice, most data warehouse interaction happens in SQL, most dbt transformations are SQL, and most data analysis interviews include SQL questions. Python and SQL are equally important.

Rushing to frameworks before mastering fundamentals. Adding Airflow to a pipeline with bad logic does not make the logic better — it just makes the bad logic run on a schedule. Master the fundamentals in this course before adding framework complexity.

Back to Course Overview | Next Section: Pandas Fundamentals →