Lesson 10: Next Steps and Resources

Course: Data Engineering | Duration: 2 hours | Level: Advanced

Learning Objectives

By the end of this lesson, you will be able to:

Identify 3 tools to learn next based on your career direction
Understand where this course fits in the full data engineering learning path
Describe what a portfolio project looks like for a junior DE role
Plan your next 3 months of learning with a concrete 30/60/90-day schedule

Prerequisites

Sections 1–10 of the Data Engineering course (all sections complete)

Lesson Outline

Part 1: Tools to Learn Next

This course gave you the Python + pandas foundation. The tools below build on that foundation for different career tracks.

Analytics Engineering Track

Tool	What It Does	Prerequisite From This Course	Learn At
dbt	SQL-based transformation layer with testing, documentation, and dependency management	S7: SQL & Databases (you understand SQL; dbt runs SQL models)	getdbt.com/learn
BigQuery	Google's serverless data warehouse — SQL on petabytes with columnar storage	S7: pd.read_sql and SQL fundamentals	cloud.google.com/bigquery/docs
Looker / Looker Studio	BI layer for visualization and dashboards connected to BigQuery/dbt models	No specific prerequisite — builds on analytics outputs you produce	lookerstudio.google.com

Analytics engineering is the fastest-growing DE sub-discipline. The core loop is: raw data → dbt transform → BigQuery table → Looker dashboard. Your pandas skills transfer directly — dbt models are SQL but the design patterns (staging → intermediate → mart layers) mirror the ETL patterns from S6.

Data Infrastructure Track

Tool	What It Does	Prerequisite From This Course	Learn At
Apache Airflow	Workflow orchestration — schedule and monitor DAGs (directed acyclic graphs) of pipeline tasks	S6: ETL Pipelines (your `run_pipeline()` becomes a DAG operator)	airflow.apache.org/docs
Prefect	Modern Python-native orchestration — simpler than Airflow for Python-heavy teams	S6: ETL Pipelines	docs.prefect.io
Dagster	Asset-based orchestration — pipelines produce "assets" with lineage tracking	S6 + S8: ETL + data quality (Dagster has built-in asset quality checks)	docs.dagster.io

Your PipelineResult dataclass pattern from S6 maps directly to Airflow task return values. Learning Airflow after this course requires mainly: understanding DAG syntax, XCom for passing data between tasks, and connection objects for databases.

Streaming Data Track

Tool	What It Does	Prerequisite From This Course	Learn At
Apache Kafka	Distributed event streaming — publish/subscribe for real-time data	S3: understanding data formats (Kafka messages are often JSON or Avro)	kafka.apache.org/quickstart
Spark Structured Streaming	Real-time version of PySpark — process infinite streams as micro-batches	S5: groupby and aggregation (same concepts, different API)	spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Streaming adds the time dimension to data — instead of processing a batch file, you process an infinite stream. The transformation patterns (filter, aggregate, join) are identical to what you learned in S5; only the execution model changes.

Machine Learning Pipelines Track

Tool	What It Does	Prerequisite From This Course	Learn At
MLflow	Experiment tracking — log parameters, metrics, and model artifacts per run	S6: pipeline logging patterns (MLflow is structured logging for ML runs)	mlflow.org/docs
Feature Store concepts	A data store for pre-computed ML features shared across models	S5 + S9: transformation and performance (feature pipelines need both correctness and speed)	docs.hopsworks.ai (Hopsworks is open source)

Which Track Is Right for You?

If you want to...	Start with
Build dashboards and reports for business users	dbt → BigQuery → Looker
Build and operate data infrastructure	Airflow → cloud platform (AWS or GCP)
Work with real-time data products	Kafka → Spark Streaming
Work on ML teams as a data specialist	MLflow → cloud ML platform

Part 2: Learning Path Context

This course is one step in a longer journey. Here is where it fits:

code

Stage 1: Python Foundation
  Python Course (Sections 1–12)
  └─ Functions, classes, file I/O, error handling, testing with pytest
          |
          v
Stage 2: Data Engineering Fundamentals  ← YOU ARE HERE
  This Course (Sections 1–10)
  └─ pandas, ETL, SQL, data quality, performance optimization
          |
          v
Stage 3: Cloud Platforms
  AWS Data Engineering or Google Cloud Professional DE
  └─ S3/GCS for storage, Redshift/BigQuery for warehousing, Lambda/Cloud Functions for compute
          |
          v
Stage 4: Orchestration
  Apache Airflow or Prefect
  └─ Schedule pipelines, manage dependencies, handle failures automatically
          |
          v
Stage 5: Distributed Processing
  Apache Spark (PySpark)
  └─ Process datasets too large for a single machine (100GB+ files)
          |
          v
Stage 6: Specialization
  Streaming (Kafka + Flink) OR Analytics Engineering (dbt + Looker)
  OR ML Pipelines (MLflow + Feature Store)

You do not need to complete all stages before being employable. A junior DE role typically requires: Stages 1–3 (Python + pandas + one cloud platform). A mid-level role requires: Stages 1–4 (adding Airflow). Senior roles require Stages 1–5+.

Part 3: Portfolio Project Ideas

A GitHub portfolio with 3 working pipelines is more valuable than any certification. Each project should have: a clear README explaining the problem, a working pipeline with tests, and a sample output file showing what it produces.

Project 1: Daily Weather ETL

What you build: Fetch daily weather data from the NOAA public API, clean the JSON response into a pandas DataFrame, validate it against a schema, append to a SQLite database, and generate a daily summary report.

Data source: NOAA Climate Data API (api.weather.gov) — free, no API key required for many endpoints

What to show in README: "Fetches 30 days of weather data for 5 cities, cleans 3 common data quality issues, stores in SQLite, and generates a trend summary. Run time: 8 seconds."

Skills demonstrated: S3 (JSON loading), S4 (cleaning), S6 (ETL structure), S7 (SQLite), S8 (validation)

Project 2: Stock Price Analyzer

What you build: Download historical stock prices from Yahoo Finance (yfinance library), compute moving averages, detect golden/death cross signals, validate data continuity (no missing trading days), and produce a formatted analysis report.

Data source: yfinance Python library — free, no API key

What to show in README: "Analyzes 1 year of price data for 10 stocks. Detects 3 signal types. Validates for data gaps. Output: text report + cleaned CSV."

Skills demonstrated: S5 (rolling window aggregation via .rolling().mean()), S8 (data continuity validation), S9 (vectorized signal detection)

Project 3: E-Commerce Log Pipeline

What you build: Download a public e-commerce event log dataset (Kaggle has several), parse event types, compute conversion funnel metrics (view → cart → purchase rates), detect high-traffic sessions, and generate a business summary.

Data source: Kaggle eCommerce Events History dataset (free download, ~100MB CSV)

What to show in README: "Processes 1M+ events. Computes conversion funnel for 5 event types. Outputs session-level stats and product performance ranking."

Skills demonstrated: S3 (large file loading with chunksize), S5 (groupby aggregation), S9 (chunked processing for memory efficiency)

Project 4: GitHub Activity Dashboard

What you build: Fetch repository statistics from the GitHub API (stars, forks, issues, commit frequency over time), store in SQLite, and produce a weekly trend report comparing 10 open-source projects.

Data source: GitHub REST API — free for public repos, 60 requests/hour unauthenticated, 5000 with token

What to show in README: "Tracks 10 repos over 52 weeks. Detects growth trends. Stores history locally. Generates weekly comparison report."

Skills demonstrated: S6 (ETL with error recovery for rate limits), S7 (SQLite for historical storage), S5 (time-based groupby)

Project 5: CSV Data Quality Auditor

What you build: A CLI tool that accepts any CSV file, runs the 4-dimension quality check from Project 6 of this course, and outputs a structured quality report. Add a --threshold flag for configurable PASS/WARN/FAIL levels.

Data source: User-supplied (the tool works on any CSV)

What to show in README: "Pass any CSV, get a quality report. Checks completeness, uniqueness, validity, and consistency. Configurable thresholds. Used on 3 public datasets in the examples/ directory."

Skills demonstrated: S8 (all quality dimension patterns), S6 (CLI pipeline structure), S3 (reading arbitrary CSVs with dtype inference)

Part 4: 30/60/90-Day Learning Plan

Use this plan immediately after completing the course. The goal: consolidate what you learned, ship one portfolio project, and start the next learning stage.

Month 1 (Days 1–30): Consolidate

Week 1–2: Recall practice

Redo all 7 projects from memory without looking at solutions
For each project: write the pipeline from a blank file, then compare to your solution
Document each function you had to look up — those are your weak spots

Week 3–4: Extend one project

Pick the project that produced the output most relevant to a job you want
Add the Extension Challenges you skipped
Add a full pytest test suite (3+ tests per function)
Write a proper README

Goal by Day 30: 1 portfolio project fully tested and documented on GitHub.

Month 2 (Days 31–60): Build and Ship

Week 5–6: Choose and plan a new project

Pick one of the 5 portfolio project ideas above (or your own)
Write the README before writing any code (define the deliverable first)
Design the architecture (ASCII diagram showing input → steps → output)

Week 7–8: Build and iterate

Implement the pipeline with tests
Handle real data quality issues (they will be different from the course datasets)
Document every design decision in the README

Goal by Day 60: A second portfolio project on GitHub with real data, working pipeline, and pytest suite.

Month 3 (Days 61–90): Next Stage

Week 9–10: Cloud platform basics

Create a free tier account on AWS or GCP
Complete one official quickstart: AWS S3 + Lambda, or GCP Cloud Storage + Cloud Functions
Migrate one of your portfolio pipelines to run on the cloud (trigger on file upload)

Week 11–12: Orchestration introduction

Install Airflow locally (Docker Desktop + official quickstart)
Convert your daily weather or stock pipeline into an Airflow DAG
Schedule it to run daily and verify it runs automatically

Goal by Day 90: One pipeline running on a cloud platform, scheduled automatically. You are now in Stage 3–4 of the learning path.

You've Completed the Data Engineering Course

You have worked through 10 sections covering the complete fundamentals of data engineering with Python and pandas: from loading raw files and cleaning messy data, through building ETL pipelines and SQL-backed storage, to enforcing data contracts and optimizing for performance. The 7 capstone projects gave you end-to-end implementations that demonstrate every major pattern.

The skills you now have — pandas, ETL design, data quality, SQL, and performance optimization — are the minimum viable toolkit for a data engineering role. They are also the foundation that every advanced tool (Airflow, Spark, dbt, Kafka) is built on top of.

Ship something. The gap between "I learned data engineering" and "I work in data engineering" is a GitHub repository with 3 working pipelines.

Key Takeaways

Skills compound: pandas + SQL + testing is the minimum viable DE toolkit; everything else (Airflow, Spark, dbt) is built on top of it
Public datasets are your practice material: NOAA, data.gov, Kaggle, SEC EDGAR, GitHub API — all free, all real, all have messy data quality issues
A GitHub portfolio with 3 working pipelines is more valuable than certifications: hiring managers run your code; they don't read certificate PDFs
Learn the next tool when you have a project that needs it: don't learn Airflow in the abstract — migrate a pipeline you care about
Teach what you build: write about your projects (GitHub README, blog post, LinkedIn post) — explaining forces clarity and creates networking opportunities

Common Mistakes to Avoid

Learning without building: reading pandas documentation and watching tutorials creates familiarity but not skill. Skill comes from debugging real data problems. Every hour spent on a real dataset is worth 3 hours of tutorial time.
Picking too complex a first portfolio project: the goal is to ship something. A CSV → cleaned CSV → SQLite pipeline with a quality report is a legitimate portfolio project. It demonstrates profiling, cleaning, validation, and SQL — exactly what a DE hiring manager wants to see. You do not need Spark or Kafka for your first project.
Skipping testing: untested pipelines fail silently in production. A pipeline that produces wrong output without raising an error is harder to debug than one that raises immediately. Three tests per function is the minimum — test the happy path, one edge case (empty DataFrame, single row), and one error case (missing column, null in required field).
Optimizing before shipping: don't reach for chunked processing or vectorization until the pipeline produces correct output. Correct and slow is debuggable. Fast and wrong is a production incident.

You've Completed the Data Engineering Course

Ten sections. Seven projects. One complete curriculum.

You now have the foundation to work with real data, build reliable pipelines, and continue learning on your own. The next step is yours to take — pick a project, write the README, and ship it.

← Lesson 9: Course Review and Patterns | ← Back to Section 10 Overview

Lesson 10: Next Steps and Resources

Learning Objectives

Prerequisites

Lesson Outline

Part 1: Tools to Learn Next

Analytics Engineering Track

Data Infrastructure Track

Streaming Data Track

Machine Learning Pipelines Track

Which Track Is Right for You?

Part 2: Learning Path Context

Part 3: Portfolio Project Ideas

Project 1: Daily Weather ETL

Project 2: Stock Price Analyzer

Project 3: E-Commerce Log Pipeline

Project 4: GitHub Activity Dashboard

Project 5: CSV Data Quality Auditor

Part 4: 30/60/90-Day Learning Plan

Month 1 (Days 1–30): Consolidate

Month 2 (Days 31–60): Build and Ship

Month 3 (Days 61–90): Next Stage

You've Completed the Data Engineering Course

Key Takeaways

Common Mistakes to Avoid

You've Completed the Data Engineering Course

Concept Map

Try it yourself