JustLearn
Data Engineering
Intermediate2 hours

Lesson 1: SQL and Databases — Why Data Engineers Use Them

Course: SQL & Databases | Duration: 2 hours | Level: Intermediate

Learning Objectives

By the end of this lesson, you will be able to:

  • Understand when to use a database vs a CSV file
  • Identify the role of SQL in data engineering pipelines
  • Recognize SQLite as an embedded database for pipelines and prototyping
  • Map database concepts to pandas concepts (table → DataFrame, row → row, column → column)

Prerequisites

  • Section 2: Pandas Fundamentals
  • Section 6: ETL Pipelines (recommended)

Lesson Outline

Part 1: Databases vs Files (30 minutes)

When Files Are Enough

For many data engineering tasks, flat files (CSV, Parquet, JSON) work perfectly well:

  • Small to medium data: files that fit in memory (< 10GB) process efficiently with pandas
  • Batch reads: you read the entire dataset once, transform it, write it out
  • Simple pipelines: one writer, no concurrent access, no complex relationships
python
import pandas as pd
 
# This works fine for files that fit in memory
df = pd.read_csv("sales_2024.csv")
filtered = df[df["region"] == "North"]
print(f"North region rows: {len(filtered)}")

When You Need a Database

Files break down in four scenarios:

1. Concurrent writes: if multiple processes write to the same CSV simultaneously, data gets corrupted. Databases handle concurrent writes with locking and transactions.

2. Complex queries on large data: reading a 1M row CSV to filter 10 rows loads all 1M rows into memory first. A database executes the filter server-side and only returns the 10 matching rows.

python
# Without a database: load everything, then filter
# This reads ALL 1M rows into memory just to get 10
df = pd.read_csv("transactions.csv")  # 1M rows, 500MB
result = df[df["transaction_id"] == 12345]  # 10 rows
 
# With a database: filter happens in the DB, only 10 rows sent over the wire
# result = pd.read_sql("SELECT * FROM transactions WHERE transaction_id = 12345", conn)

3. Referential integrity: databases enforce relationships — you cannot add an order for a customer that does not exist. Files have no enforcement mechanism.

4. Data that does not fit in memory: a 500GB dataset can live in a database and be queried efficiently; it cannot be loaded into a pandas DataFrame.

Types of Databases

Data engineers work with all three categories:

CategoryExamplesUse Case
Relational (RDBMS)SQLite, PostgreSQL, MySQLTransactional data, normalized schemas, ACID guarantees
Column-store (OLAP)BigQuery, Snowflake, RedshiftAnalytical queries on billions of rows, aggregations
NoSQLMongoDB, Redis, CassandraFlexible schemas, key-value lookups, document storage

This section focuses on relational databases using SQLite — the most portable option, with no server required.

Part 2: SQL for Data Engineers (30 minutes)

What SQL Is

SQL (Structured Query Language) is the standard language for working with relational databases. It runs inside the database engine — not in Python — which is what makes it powerful for large datasets.

Core SQL statement types:

StatementTypeData Engineering Use
SELECTDQL (Query)Extract data from tables
INSERTDML (Manipulation)Load records into tables
UPDATEDMLChange existing records
DELETEDMLRemove records
CREATE TABLEDDL (Definition)Create the schema
DROP TABLEDDLRemove a table

In practice, most pipeline work uses SELECT (extraction), INSERT/UPDATE (loading), and CREATE TABLE (schema management).

Entity Relationships

Relational databases store data in normalized tables connected by foreign keys. This eliminates duplication:

code
customers table         orders table              products table
-----------             -----------               -----------
customer_id (PK)  ←──  customer_id (FK)          product_id (PK)
name                    order_id (PK)         ←── product_id (FK)  in orders
email                   product_id (FK) ─────/    name
city                    quantity                   category
                        order_date                 unit_price
                        amount

A customer's name is stored once in customers. Orders reference that customer by customer_id. This is a one-to-many relationship: one customer can have many orders.

Part 3: Pandas ↔ SQL Mental Model (30 minutes)

Side-by-Side Comparison

The same operations exist in both SQL and pandas. Understanding the mapping makes you fluent in both:

OperationSQLpandas
Select columnsSELECT name, salary FROM employeesdf[["name", "salary"]]
Filter rowsSELECT * FROM sales WHERE region = 'North'df[df["region"] == "North"]
SortORDER BY salary DESCdf.sort_values("salary", ascending=False)
AggregateSELECT region, SUM(revenue) FROM sales GROUP BY regiondf.groupby("region")["revenue"].sum()
JoinSELECT * FROM orders JOIN customers ON orders.customer_id = customers.idpd.merge(orders, customers, on="customer_id")
Count rowsSELECT COUNT(*) FROM orderslen(df)
python
import pandas as pd
 
# Create sample data to demonstrate the mapping
sales = pd.DataFrame({
    "region": ["North", "South", "North", "East", "South"],
    "product": ["A", "B", "A", "C", "A"],
    "revenue": [1200, 800, 1500, 600, 900],
})
 
# pandas groupby — same as: SELECT region, SUM(revenue) FROM sales GROUP BY region
result_pandas = sales.groupby("region")["revenue"].sum().reset_index()
result_pandas.columns = ["region", "total_revenue"]
print("pandas result:")
print(result_pandas)

When to Use SQL vs pandas

Use SQL when:

  • Data lives in a database and is too large to load entirely into memory
  • You need server-side aggregation (let the database do the heavy lifting)
  • Multiple concurrent pipelines read/write the same dataset

Use pandas when:

  • Data is already in memory or in files
  • You need complex Python logic (custom functions, ML features, string processing)
  • You need reshaping operations like pivot, melt, or stack

In practice, use both: SQL extracts and aggregates, pandas transforms and enriches.

Part 4: Hands-on Practice (30 minutes)

Exercise 1: SQL to pandas Translation

Translate these SQL statements to pandas code (using the sales DataFrame above):

python
import pandas as pd
 
sales = pd.DataFrame({
    "region": ["North", "South", "North", "East", "South", "East"],
    "product": ["A", "B", "A", "C", "A", "B"],
    "revenue": [1200, 800, 1500, 600, 900, 1100],
    "quantity": [10, 5, 12, 4, 7, 9],
})
 
# SQL: SELECT * FROM sales WHERE revenue > 1000
# Your pandas code here:
 
# SQL: SELECT region, COUNT(*) as row_count FROM sales GROUP BY region
# Your pandas code here:
 
# SQL: SELECT * FROM sales ORDER BY revenue DESC LIMIT 3
# Your pandas code here:

Exercise 2: pandas to SQL Translation

Translate these pandas operations to SQL:

python
# Given this pandas code, write the equivalent SQL:
 
# 1. df[df["product"] == "A"]["revenue"].mean()
# SQL equivalent: SELECT ??? FROM sales WHERE ???
 
# 2. df.groupby("region")["quantity"].sum().sort_values(ascending=False)
# SQL equivalent: SELECT ??? FROM sales GROUP BY ??? ORDER BY ???
 
# 3. df[df["revenue"] > df["revenue"].mean()]
# SQL equivalent: SELECT ??? FROM sales WHERE revenue > (SELECT ???)

Key Takeaways

  • Use a database when data is too large for memory, has concurrent writers, or requires referential integrity; use files for batch workloads that fit in memory
  • SQL is the query language for relational databases — it runs server-side, filtering and aggregating before results reach Python
  • SQLite is an embedded relational database (no server) — perfect for learning, testing, and lightweight pipelines
  • pandas and SQL concepts map directly: table = DataFrame, row = row, column = column; GROUP BY = groupby(), JOIN = pd.merge()
  • In real pipelines, use both: SQL extracts and aggregates at the database level, pandas handles Python-side transformation

Common Mistakes to Avoid

  • Loading entire tables into pandas when you only need a subset: always filter in SQL first with WHERE, then use pandas for downstream logic
  • Treating SQL and pandas as alternatives: they are complementary — SQL at the database layer, pandas in Python
  • Forgetting that SQL runs in the database: pd.read_sql("SELECT ...") sends the query to the DB engine; the DB does the work

← Section 6 | Back to Course | Next →

Concept Map

Try it yourself

Write Python code below and click Run to execute it in your browser.