Lesson 1: What is Data Engineering?

Course: Data Engineering | Duration: 30 minutes | Level: Intermediate

Learning Objectives

By the end of this lesson, you will be able to:

Define data engineering in your own words
Explain why data engineering exists as a discipline separate from data science and analytics
Describe what data engineers actually do day-to-day
Distinguish between data engineering, data science, and data analytics roles

Prerequisites

Completion of the Python Fundamentals course (or equivalent Python knowledge)
Familiarity with the concept of files, databases, and APIs at a high level

Part 1: The Data Problem Every Company Faces

Most companies are drowning in raw data and starving for usable information.

Consider a mid-sized e-commerce company. Every day, their systems generate millions of events: products viewed, items added to carts, purchases completed, support tickets opened, inventory updated, emails sent, deliveries tracked. All of this data lands in dozens of different systems — a transactional database for orders, a separate system for inventory, a third-party email platform, a shipping provider's API, web analytics logs stored as raw JSON files on cloud storage.

The business questions are straightforward: Which products are trending? Which customers are most valuable? Where in the checkout flow do users abandon? Is our inventory going to last through the holiday season?

But the data that answers those questions is not sitting in one place, clean and ready to query. It is scattered, inconsistently formatted, partially duplicated, delayed, and often corrupted by edge cases that nobody anticipated when the systems were built. The marketing team has a different definition of "active customer" than the finance team. The data from the shipping provider arrives 24 hours late. The inventory system uses product IDs that don't match the IDs used in the order database.

This is the fundamental data problem: raw data exists everywhere, but the infrastructure to transform it into reliable, consistent, queryable information does not exist by default. Someone has to build that infrastructure. That someone is a data engineer.

Part 2: What Data Engineering Is

Data engineering is the discipline of designing, building, and maintaining the systems that collect, store, process, and deliver data reliably at scale.

The key word is systems. Data engineers are not analysts running ad-hoc queries. They are not data scientists building models. They are infrastructure engineers — but their infrastructure is made of data pipelines, storage layers, and processing systems rather than servers and networks.

A complete, working definition: a data engineer builds the plumbing that turns raw data into reliable data products that other people can trust and use.

The roads and cars analogy is one of the most useful mental models for this field:

Data engineers build the roads, bridges, and highways
Data scientists drive the cars to their destinations (insights, models, predictions)
Data analysts navigate and report on traffic patterns
The roads have to exist, be maintained, and handle the load — before any of the cars can go anywhere

Without the roads, the most sophisticated driving is impossible. Companies that invest heavily in data science but neglect data engineering consistently find that their data scientists spend 80% of their time cleaning data instead of doing science.

What data engineering is not:

Data engineering is sometimes confused with related roles. The distinction matters:

Data science is concerned with extracting insight from data through statistical analysis, machine learning, and experimentation. Data scientists build models. Data engineers build the systems that feed those models with clean data.
Data analytics is concerned with querying and visualizing data to answer business questions. Analysts consume clean, structured data. Data engineers produce that clean, structured data.
Database administration (DBA) is concerned with operating and maintaining production databases. DBAs keep the lights on. Data engineers build pipelines that move data between those databases and other systems.
Software engineering builds the products and applications that generate data as a byproduct. Data engineers collect and process the data those applications generate.

There is genuine overlap between these roles, especially at smaller companies. But the core identity of data engineering is infrastructure for data.

Part 3: What Data Engineers Actually Do

The day-to-day work of a data engineer falls into a few recurring categories.

Designing and building data pipelines is the most fundamental activity. A pipeline is an automated workflow that extracts data from a source, optionally transforms it, and loads it into a destination. Data engineers design these pipelines to be reliable, scalable, and maintainable. In practice this means writing Python (or SQL, or Spark) code that runs on a schedule and handles all the edge cases: what happens when the source API is down? What happens when the schema changes? What happens when a record is duplicated?

Managing data infrastructure means owning the operational layer: the cloud storage buckets where raw data lands, the data warehouse where transformed data lives, the orchestration system that schedules and monitors pipeline runs, the alerting system that pages someone when a pipeline fails. This is closer to DevOps than to traditional software development.

Ensuring data availability and quality means that the data products a data engineer delivers must be trustworthy. If the sales dashboard shows yesterday's revenue as zero because the pipeline silently failed, business decisions get made on wrong information. Data engineers build monitoring, alerting, and data quality checks into their pipelines so that failures are detected and reported before downstream users notice.

Adding new data sources is a constant demand. The business wants to analyze a new SaaS tool's data. Marketing signed a contract with a new data vendor. The mobile app team added new event tracking. Data engineers assess these sources, build connectors, and integrate the new data into the existing warehouse in a way that is consistent and queryable.

Collaborating with data consumers is unavoidable. Data engineers work with analysts, data scientists, and product managers to understand what data they need, in what form, with what freshness. The pipeline is not done when it runs — it is done when the consumer can actually use it.

Part 4: A Day in the Life

To make this concrete, here is what a typical working day might look like for a data engineer at a mid-sized company.

9:00 AM — Pipeline alert. An automated alert fires: the daily customer activity pipeline failed overnight. The engineer opens the monitoring dashboard, finds the error log, and sees that the source API returned a 429 (rate limit) error at 2 AM. The fix is straightforward: add exponential backoff retry logic to the extractor. The engineer makes the fix, re-runs the pipeline manually, confirms it succeeds, and deploys the change.

10:30 AM — New data source request. A product manager asks for Salesforce CRM data to be available in the data warehouse so that the sales team can join it with product usage data. The engineer reviews the Salesforce API docs, estimates the work, identifies which objects are needed (Accounts, Opportunities, Contacts), and starts building the extractor. This will take two to three days.

1:00 PM — Schema change incident. The analytics team reports that a dashboard is broken. Investigating, the engineer discovers that the upstream application team changed a column name in their production database two days ago without telling anyone. The pipeline faithfully ingested the new schema, but downstream transformations that expected the old column name are now failing silently and producing nulls. The fix requires updating the transformation, adding a schema validation check, and creating a communication process so that application teams notify data engineering before making schema changes.

3:00 PM — Performance optimization. A weekly report that used to run in 2 minutes now takes 45 minutes. The engineer profiles the query, identifies a missing index on a join column in the staging table, and benchmarks the change. After adding the index, the query runs in 90 seconds.

4:30 PM — Documentation and handoff. The engineer documents the Salesforce connector design in the team wiki, updates the pipeline runbook, and creates a pull request for code review.

This is ordinary data engineering work: a mixture of debugging, building, and maintaining. The problems are real, the impact is immediate, and the craft is genuinely interesting.

Key Takeaways

Data engineering exists because raw data is unusable without infrastructure to collect, clean, and organize it
Data engineers build pipelines, manage data infrastructure, and ensure data quality — they do not typically run statistical analyses or build ML models
The roads-and-cars analogy: DE builds the roads; data science and analytics drive on them
Core activities: building pipelines, managing storage, ensuring reliability, adding new sources, collaborating with consumers
Data engineering is closer to software/infrastructure engineering than to analytics or statistics

Common Mistakes to Avoid

Confusing data engineering with data science. Many people new to the field assume that data engineering is a stepping stone toward data science, or that it is a lesser version of it. In reality, data engineering is a distinct discipline with its own depth, career path, and skill set. Neither is more important — they are complementary.

Thinking data engineering is only for big companies. The problems data engineering solves — reliable pipelines, consistent data, trustworthy dashboards — appear at companies with 10 employees and at companies with 100,000. The tools and scale differ, but the discipline applies everywhere data is used to make decisions.

Assuming the job is mostly coding. Writing code is part of the job, but communication, system design, and operational discipline are equally important. A brilliant pipeline that nobody trusts because it has no monitoring is worse than a simple pipeline with solid alerting.

Next Lesson Preview

In Lesson 2: The Modern Data Stack, you will learn about the layered architecture that most data teams use today — from the sources where data originates to the BI tools where it is finally consumed. You will see where each layer fits, which tools dominate each layer, and why the stack looks the way it does.

Back to Course Overview | Next Lesson: The Modern Data Stack →