JustLearn
Data Engineering
Advanced2 hours

Lesson 10: Next Steps and Resources

Course: Data Engineering | Duration: 2 hours | Level: Advanced

Learning Objectives

By the end of this lesson, you will be able to:

  • Identify 3 tools to learn next based on your career direction
  • Understand where this course fits in the full data engineering learning path
  • Describe what a portfolio project looks like for a junior DE role
  • Plan your next 3 months of learning with a concrete 30/60/90-day schedule

Prerequisites

  • Sections 1–10 of the Data Engineering course (all sections complete)

Lesson Outline

Part 1: Tools to Learn Next

This course gave you the Python + pandas foundation. The tools below build on that foundation for different career tracks.

Analytics Engineering Track

ToolWhat It DoesPrerequisite From This CourseLearn At
dbtSQL-based transformation layer with testing, documentation, and dependency managementS7: SQL & Databases (you understand SQL; dbt runs SQL models)getdbt.com/learn
BigQueryGoogle's serverless data warehouse — SQL on petabytes with columnar storageS7: pd.read_sql and SQL fundamentalscloud.google.com/bigquery/docs
Looker / Looker StudioBI layer for visualization and dashboards connected to BigQuery/dbt modelsNo specific prerequisite — builds on analytics outputs you producelookerstudio.google.com

Analytics engineering is the fastest-growing DE sub-discipline. The core loop is: raw data → dbt transform → BigQuery table → Looker dashboard. Your pandas skills transfer directly — dbt models are SQL but the design patterns (staging → intermediate → mart layers) mirror the ETL patterns from S6.

Data Infrastructure Track

ToolWhat It DoesPrerequisite From This CourseLearn At
Apache AirflowWorkflow orchestration — schedule and monitor DAGs (directed acyclic graphs) of pipeline tasksS6: ETL Pipelines (your run_pipeline() becomes a DAG operator)airflow.apache.org/docs
PrefectModern Python-native orchestration — simpler than Airflow for Python-heavy teamsS6: ETL Pipelinesdocs.prefect.io
DagsterAsset-based orchestration — pipelines produce "assets" with lineage trackingS6 + S8: ETL + data quality (Dagster has built-in asset quality checks)docs.dagster.io

Your PipelineResult dataclass pattern from S6 maps directly to Airflow task return values. Learning Airflow after this course requires mainly: understanding DAG syntax, XCom for passing data between tasks, and connection objects for databases.

Streaming Data Track

ToolWhat It DoesPrerequisite From This CourseLearn At
Apache KafkaDistributed event streaming — publish/subscribe for real-time dataS3: understanding data formats (Kafka messages are often JSON or Avro)kafka.apache.org/quickstart
Spark Structured StreamingReal-time version of PySpark — process infinite streams as micro-batchesS5: groupby and aggregation (same concepts, different API)spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Streaming adds the time dimension to data — instead of processing a batch file, you process an infinite stream. The transformation patterns (filter, aggregate, join) are identical to what you learned in S5; only the execution model changes.

Machine Learning Pipelines Track

ToolWhat It DoesPrerequisite From This CourseLearn At
MLflowExperiment tracking — log parameters, metrics, and model artifacts per runS6: pipeline logging patterns (MLflow is structured logging for ML runs)mlflow.org/docs
Feature Store conceptsA data store for pre-computed ML features shared across modelsS5 + S9: transformation and performance (feature pipelines need both correctness and speed)docs.hopsworks.ai (Hopsworks is open source)

Which Track Is Right for You?

If you want to...Start with
Build dashboards and reports for business usersdbt → BigQuery → Looker
Build and operate data infrastructureAirflow → cloud platform (AWS or GCP)
Work with real-time data productsKafka → Spark Streaming
Work on ML teams as a data specialistMLflow → cloud ML platform

Part 2: Learning Path Context

This course is one step in a longer journey. Here is where it fits:

code
Stage 1: Python Foundation
  Python Course (Sections 1–12)
  └─ Functions, classes, file I/O, error handling, testing with pytest
          |
          v
Stage 2: Data Engineering Fundamentals  ← YOU ARE HERE
  This Course (Sections 1–10)
  └─ pandas, ETL, SQL, data quality, performance optimization
          |
          v
Stage 3: Cloud Platforms
  AWS Data Engineering or Google Cloud Professional DE
  └─ S3/GCS for storage, Redshift/BigQuery for warehousing, Lambda/Cloud Functions for compute
          |
          v
Stage 4: Orchestration
  Apache Airflow or Prefect
  └─ Schedule pipelines, manage dependencies, handle failures automatically
          |
          v
Stage 5: Distributed Processing
  Apache Spark (PySpark)
  └─ Process datasets too large for a single machine (100GB+ files)
          |
          v
Stage 6: Specialization
  Streaming (Kafka + Flink) OR Analytics Engineering (dbt + Looker)
  OR ML Pipelines (MLflow + Feature Store)

You do not need to complete all stages before being employable. A junior DE role typically requires: Stages 1–3 (Python + pandas + one cloud platform). A mid-level role requires: Stages 1–4 (adding Airflow). Senior roles require Stages 1–5+.

Part 3: Portfolio Project Ideas

A GitHub portfolio with 3 working pipelines is more valuable than any certification. Each project should have: a clear README explaining the problem, a working pipeline with tests, and a sample output file showing what it produces.

Project 1: Daily Weather ETL

What you build: Fetch daily weather data from the NOAA public API, clean the JSON response into a pandas DataFrame, validate it against a schema, append to a SQLite database, and generate a daily summary report.

Data source: NOAA Climate Data API (api.weather.gov) — free, no API key required for many endpoints

What to show in README: "Fetches 30 days of weather data for 5 cities, cleans 3 common data quality issues, stores in SQLite, and generates a trend summary. Run time: 8 seconds."

Skills demonstrated: S3 (JSON loading), S4 (cleaning), S6 (ETL structure), S7 (SQLite), S8 (validation)

Project 2: Stock Price Analyzer

What you build: Download historical stock prices from Yahoo Finance (yfinance library), compute moving averages, detect golden/death cross signals, validate data continuity (no missing trading days), and produce a formatted analysis report.

Data source: yfinance Python library — free, no API key

What to show in README: "Analyzes 1 year of price data for 10 stocks. Detects 3 signal types. Validates for data gaps. Output: text report + cleaned CSV."

Skills demonstrated: S5 (rolling window aggregation via .rolling().mean()), S8 (data continuity validation), S9 (vectorized signal detection)

Project 3: E-Commerce Log Pipeline

What you build: Download a public e-commerce event log dataset (Kaggle has several), parse event types, compute conversion funnel metrics (view → cart → purchase rates), detect high-traffic sessions, and generate a business summary.

Data source: Kaggle eCommerce Events History dataset (free download, ~100MB CSV)

What to show in README: "Processes 1M+ events. Computes conversion funnel for 5 event types. Outputs session-level stats and product performance ranking."

Skills demonstrated: S3 (large file loading with chunksize), S5 (groupby aggregation), S9 (chunked processing for memory efficiency)

Project 4: GitHub Activity Dashboard

What you build: Fetch repository statistics from the GitHub API (stars, forks, issues, commit frequency over time), store in SQLite, and produce a weekly trend report comparing 10 open-source projects.

Data source: GitHub REST API — free for public repos, 60 requests/hour unauthenticated, 5000 with token

What to show in README: "Tracks 10 repos over 52 weeks. Detects growth trends. Stores history locally. Generates weekly comparison report."

Skills demonstrated: S6 (ETL with error recovery for rate limits), S7 (SQLite for historical storage), S5 (time-based groupby)

Project 5: CSV Data Quality Auditor

What you build: A CLI tool that accepts any CSV file, runs the 4-dimension quality check from Project 6 of this course, and outputs a structured quality report. Add a --threshold flag for configurable PASS/WARN/FAIL levels.

Data source: User-supplied (the tool works on any CSV)

What to show in README: "Pass any CSV, get a quality report. Checks completeness, uniqueness, validity, and consistency. Configurable thresholds. Used on 3 public datasets in the examples/ directory."

Skills demonstrated: S8 (all quality dimension patterns), S6 (CLI pipeline structure), S3 (reading arbitrary CSVs with dtype inference)

Part 4: 30/60/90-Day Learning Plan

Use this plan immediately after completing the course. The goal: consolidate what you learned, ship one portfolio project, and start the next learning stage.

Month 1 (Days 1–30): Consolidate

Week 1–2: Recall practice

  • Redo all 7 projects from memory without looking at solutions
  • For each project: write the pipeline from a blank file, then compare to your solution
  • Document each function you had to look up — those are your weak spots

Week 3–4: Extend one project

  • Pick the project that produced the output most relevant to a job you want
  • Add the Extension Challenges you skipped
  • Add a full pytest test suite (3+ tests per function)
  • Write a proper README

Goal by Day 30: 1 portfolio project fully tested and documented on GitHub.

Month 2 (Days 31–60): Build and Ship

Week 5–6: Choose and plan a new project

  • Pick one of the 5 portfolio project ideas above (or your own)
  • Write the README before writing any code (define the deliverable first)
  • Design the architecture (ASCII diagram showing input → steps → output)

Week 7–8: Build and iterate

  • Implement the pipeline with tests
  • Handle real data quality issues (they will be different from the course datasets)
  • Document every design decision in the README

Goal by Day 60: A second portfolio project on GitHub with real data, working pipeline, and pytest suite.

Month 3 (Days 61–90): Next Stage

Week 9–10: Cloud platform basics

  • Create a free tier account on AWS or GCP
  • Complete one official quickstart: AWS S3 + Lambda, or GCP Cloud Storage + Cloud Functions
  • Migrate one of your portfolio pipelines to run on the cloud (trigger on file upload)

Week 11–12: Orchestration introduction

  • Install Airflow locally (Docker Desktop + official quickstart)
  • Convert your daily weather or stock pipeline into an Airflow DAG
  • Schedule it to run daily and verify it runs automatically

Goal by Day 90: One pipeline running on a cloud platform, scheduled automatically. You are now in Stage 3–4 of the learning path.

You've Completed the Data Engineering Course

You have worked through 10 sections covering the complete fundamentals of data engineering with Python and pandas: from loading raw files and cleaning messy data, through building ETL pipelines and SQL-backed storage, to enforcing data contracts and optimizing for performance. The 7 capstone projects gave you end-to-end implementations that demonstrate every major pattern.

The skills you now have — pandas, ETL design, data quality, SQL, and performance optimization — are the minimum viable toolkit for a data engineering role. They are also the foundation that every advanced tool (Airflow, Spark, dbt, Kafka) is built on top of.

Ship something. The gap between "I learned data engineering" and "I work in data engineering" is a GitHub repository with 3 working pipelines.

Key Takeaways

  • Skills compound: pandas + SQL + testing is the minimum viable DE toolkit; everything else (Airflow, Spark, dbt) is built on top of it
  • Public datasets are your practice material: NOAA, data.gov, Kaggle, SEC EDGAR, GitHub API — all free, all real, all have messy data quality issues
  • A GitHub portfolio with 3 working pipelines is more valuable than certifications: hiring managers run your code; they don't read certificate PDFs
  • Learn the next tool when you have a project that needs it: don't learn Airflow in the abstract — migrate a pipeline you care about
  • Teach what you build: write about your projects (GitHub README, blog post, LinkedIn post) — explaining forces clarity and creates networking opportunities

Common Mistakes to Avoid

  • Learning without building: reading pandas documentation and watching tutorials creates familiarity but not skill. Skill comes from debugging real data problems. Every hour spent on a real dataset is worth 3 hours of tutorial time.

  • Picking too complex a first portfolio project: the goal is to ship something. A CSV → cleaned CSV → SQLite pipeline with a quality report is a legitimate portfolio project. It demonstrates profiling, cleaning, validation, and SQL — exactly what a DE hiring manager wants to see. You do not need Spark or Kafka for your first project.

  • Skipping testing: untested pipelines fail silently in production. A pipeline that produces wrong output without raising an error is harder to debug than one that raises immediately. Three tests per function is the minimum — test the happy path, one edge case (empty DataFrame, single row), and one error case (missing column, null in required field).

  • Optimizing before shipping: don't reach for chunked processing or vectorization until the pipeline produces correct output. Correct and slow is debuggable. Fast and wrong is a production incident.

You've Completed the Data Engineering Course

Ten sections. Seven projects. One complete curriculum.

You now have the foundation to work with real data, build reliable pipelines, and continue learning on your own. The next step is yours to take — pick a project, write the README, and ship it.

← Lesson 9: Course Review and Patterns | ← Back to Section 10 Overview

Concept Map

Try it yourself

Write Python code below and click Run to execute it in your browser.