Lesson 3: The Big Three Providers — Live Model Comparison

Course: AI-Powered Development (Dev Track) | Duration: 2 hours | Level: Beginner

Learning Objectives

By the end of this lesson, students will be able to:

Distinguish the key strengths, weaknesses, and pricing of Claude, GPT/o-series, and Gemini
Explain what "frontier model" means and why it matters for development work
Select the right model for a given coding task based on objective criteria
Run an open-source model locally using Ollama
Write a clear model selection rationale for their own project

Prerequisites

Lesson 1: What is an LLM? (understanding tokenization, context windows, temperature)
Lesson 2: Model Families and Versioning (understanding Opus/Sonnet/Haiku-style tiers)
An API key for at least one provider (Anthropic, OpenAI, or Google) is helpful but not required

Lesson Outline

AI Provider Landscape — Anthropic, OpenAI, Google, Open Source

Part 1: Cold Open — Same Prompt, 4 Models (15 min)

The Setup

The fastest way to understand model differences is not to read benchmark tables — it is to watch them solve the same problem side by side. Before any theory, let's run an experiment.

The Prompt (sent verbatim to four models):

code

You are a senior Python engineer. Refactor the following function so that it:
1. Has proper type hints
2. Handles edge cases (empty input, None, non-integer items in the list)
3. Is well-documented with a Google-style docstring
4. Does not change the observable behavior for valid inputs
5. Returns a list, not a generator

Original code:
def process(data):
    return (x * 2 for x in data if x > 0)

Respond with ONLY the refactored code, no explanation.

Side-by-Side Outputs

Claude Opus 4 (Anthropic):

python

from typing import Optional
 
 
def process(data: Optional[list]) -> list:
    """Double all positive integers in a sequence.
 
    Args:
        data: A list of values to process. May be None or contain
              non-integer items, which are silently skipped.
 
    Returns:
        A new list containing each positive integer from `data`
        multiplied by 2, in input order. Returns an empty list if
        `data` is None, empty, or contains no positive integers.
 
    Examples:
        >>> process([1, -2, 3, "x", None])
        [2, 6]
        >>> process([])
        []
        >>> process(None)
        []
    """
    if not data:
        return []
    return [x * 2 for x in data if isinstance(x, int) and x > 0]

GPT-5 (OpenAI):

python

from typing import Any
 
 
def process(data: list[Any] | None) -> list[int]:
    """Double all positive integers in the input list.
 
    Args:
        data: Input list. Non-integer items and non-positive integers
              are excluded. None is treated as an empty list.
 
    Returns:
        List of doubled positive integers.
 
    Examples:
        >>> process([1, 2, -1, "skip", None])
        [2, 4]
    """
    if data is None:
        return []
    return [x * 2 for x in data if isinstance(x, int) and x > 0]

Gemini 2.5 Pro (Google):

python

from typing import Any
 
 
def process(data: list[Any] | None = None) -> list[int]:
    """Doubles all positive integers found in the input data list.
 
    Iterates through the provided list, filters for items that are
    positive integers, doubles them, and returns the results as a list.
    Non-integer types and non-positive integers are silently ignored.
 
    Args:
        data: A list of items to process. Items that are not integers
              or are not positive will be ignored. Defaults to None.
 
    Returns:
        A list of integers, where each element is double the value
        of a positive integer found in `data`. Returns an empty list
        if `data` is None, empty, or contains no qualifying items.
 
    Examples:
        >>> process([1, 2, 3])
        [2, 4, 6]
        >>> process([1, -1, 0, 'hello', None, 2])
        [2, 4]
        >>> process([])
        []
        >>> process(None)
        []
    """
    if not data:
        return []
    return [item * 2 for item in data if isinstance(item, int) and item > 0]

from typing import Any
 
 
def process(data: list[Any] | None = None) -> list[int]:
    """Doubles all positive integers found in the input data list.
 
    Iterates through the provided list, filters for items that are
    positive integers, doubles them, and returns the results as a list.
    Non-integer types and non-positive integers are silently ignored.
 
    Args:
        data: A list of items to process. Items that are not integers
              or are not positive will be ignored. Defaults to None.
 
    Returns:
        A list of integers, where each element is double the value
        of a positive integer found in `data`. Returns an empty list
        if `data` is None, empty, or contains no qualifying items.
 
    Examples:
        >>> process([1, 2, 3])
        [2, 4, 6]
        >>> process([1, -1, 0, 'hello', None, 2])
        [2, 4]
        >>> process([])
        []
        >>> process(None)
        []
    """
    if not data:
        return []
    return [item * 2 for item in data if isinstance(item, int) and item > 0]

DeepSeek R1 (DeepSeek):

python

def process(data: list | None) -> list[int]:
    """Process a list by doubling positive integers.
 
    Args:
        data: Input list, may contain any types or be None.
 
    Returns:
        List of positive integers from data each multiplied by 2.
        Empty list if data is None, empty, or has no positive ints.
    """
    if not data:
        return []
    return [x * 2 for x in data if isinstance(x, int) and x > 0]

What Do We Observe?

Dimension	Claude Opus 4	GPT-5	Gemini 2.5 Pro	DeepSeek R1
Instruction followed exactly?	Yes	Yes	Almost (added default arg)	Yes
Docstring quality	Excellent (with doctests)	Good	Verbose but thorough	Concise
Type hint style	`Optional[list]` (3.9 compat)	`list[Any]	None` (3.10+)	`list[Any]
Edge case: `None`	Handled	Handled	Handled + default arg	Handled
Edge case: non-integer	Handled	Handled	Handled	Handled
Token count (approx)	280	190	320	150
Unexpected additions	Doctests in docstring	None	Default parameter	None

Key observations:

All four models produced functionally correct, production-quality code
Claude added executable doctests unprompted — useful but technically outside the spec
Gemini added = None as a default parameter — a reasonable interpretation but technically changes the function signature
DeepSeek R1 was the most terse and literal — least opinionated
GPT-5 used the most modern Python union syntax (|) by default

Teacher's Note: Ask the class: "Which output would you submit to a code review? Which surprised you?" There is no wrong answer here — the goal is to build intuition about model "personalities."

Part 2: Anthropic — Claude Models (25 min)

The Claude Family (2025-2026)

Anthropic releases models in a tiered structure. As of early 2026, the primary models available are:

Model	Context Window	Best Use	Approx. Input Price	Approx. Output Price
Claude Opus 4	200K tokens	Complex reasoning, large codebase work	$15 / 1M tokens	$75 / 1M tokens
Claude Sonnet 4	200K tokens	Daily coding, balanced speed/quality	$3 / 1M tokens	$15 / 1M tokens
Claude Haiku 3.5	200K tokens	Fast, cheap, simple tasks	$0.80 / 1M tokens	$4 / 1M tokens

Note: Prices listed are approximate as of Q1 2026. Always verify current pricing at console.anthropic.com/pricing.

What Makes Claude Different

1. Instruction following

Claude is widely regarded as the strongest model for precise, multi-step instruction following. In benchmarks and anecdotal developer experience, Claude is less likely to "drift" — to add unrequested content, change a constraint mid-response, or silently ignore a negative instruction ("do NOT include X").

This matters enormously in coding workflows where you write system prompts like: "Never add comments. Always return only the changed lines. Never rename variables."

2. Long-context faithfulness (200K tokens)

200,000 tokens is approximately 150,000 words — roughly 500 pages of code. Claude can:

Accept an entire medium-sized codebase in a single prompt
Answer questions about cross-file dependencies
Produce a multi-file diff with consistent variable naming across files

Claude's context faithfulness (how well it uses content from early in a long prompt) is measured in the "Needle in a Haystack" benchmark. Claude consistently scores near the top.

3. Multi-file architectural planning

This is where Claude particularly shines. Given a complex refactoring task spanning 10+ files, Claude tends to:

Recognize implicit constraints (e.g., a function in utils.py is called from 4 other files — renaming it requires updating all 4)
Produce a reasoning trace before the code (if asked)
Maintain consistent design decisions across a long response

4. Constitutional AI background

Anthropic was founded with a focus on AI safety research. Claude's training includes techniques designed to make it more honest (less likely to hallucinate confidently), more helpful, and less harmful. For production coding work, this translates to Claude being more willing to say "I'm not sure" or "this might not be the best approach" rather than confidently generating buggy code.

Code Example: Multi-File Refactoring Task

Scenario: You have a Django app where the User model has a full_name property computed from first_name and last_name. A new requirement says you need to support a display_name field that can be set independently. You need Claude to identify all the places this will break and produce a migration plan.

Prompt to Claude Opus 4:

code

I have a Django app. Here are 4 files. I need to add a `display_name` field
to the User model that:
- Is optional (nullable)
- Falls back to `first_name + " " + last_name` if not set
- Must be updated in the admin, serializers, and all templates that use `full_name`

Analyze the impact and give me the complete list of changes needed, then
produce each changed file in full.

[FILE: models.py]
[FILE: admin.py]
[FILE: serializers.py]
[FILE: templates/profile.html]

What Claude produces (summary of output structure):

code

Impact Analysis:
1. models.py — Add `display_name = models.CharField(max_length=200, blank=True, null=True)`
   Update `full_name` property to use display_name if set.
   Add database migration.

2. admin.py — Add `display_name` to `fieldsets` and `list_display`.

3. serializers.py — Add `display_name` to fields list.
   Update `full_name` SerializerMethodField to call model property.

4. templates/profile.html — No change needed (uses `{{ user.full_name }}`
   which now goes through the updated property).

Migration risk: LOW — existing `full_name` callers all go through the property,
so no data migration required for existing records.

--- models.py (complete) ---
[full file content...]

--- admin.py (complete) ---
[full file content...]

--- serializers.py (complete) ---
[full file content...]

--- 0003_user_display_name.py (migration) ---
[full migration file content...]

Why this showcases Claude's strength: It identified that templates/profile.html did NOT need changing (because it called full_name which is a property — not the database field directly). A weaker model would have either missed the template entirely or incorrectly modified it.

When to Use Claude

Large codebase navigation (loading 50+ files into context)
Complex, multi-step refactors requiring architectural awareness
Tasks with strict output format requirements ("return ONLY valid JSON")
Security-sensitive code review (Claude tends to flag issues proactively)
Writing technical documentation from code

When NOT to Use Claude

Rapid fire, cheap API calls at scale (cost can add up — use Haiku or Sonnet)
Real-time streaming chat UIs (latency for Opus is higher)
Tasks requiring internet access (Claude cannot browse the web natively)

Part 3: OpenAI — GPT and o-series Models (25 min)

The OpenAI Model Lineup (2025-2026)

OpenAI maintains two distinct model families with different reasoning approaches:

Model	Context Window	Reasoning Type	Best Use	Approx. Input Price	Approx. Output Price
GPT-5	128K tokens	Standard	General coding, chat, prototyping	$2.50 / 1M tokens	$10 / 1M tokens
o3	200K tokens	Extended thinking	Math, algorithms, hard logic	$10 / 1M tokens	$40 / 1M tokens
o4-mini	128K tokens	Extended thinking (fast)	Coding + reasoning at low cost	$1.10 / 1M tokens	$4.40 / 1M tokens
GPT-4o	128K tokens	Standard (multimodal)	Vision tasks, existing integrations	$2.50 / 1M tokens	$10 / 1M tokens

Note: Prices are approximate as of Q1 2026. Verify at platform.openai.com/docs/pricing.

The GPT Family — Standard Models

GPT-5 is the flagship general-purpose model. Its key characteristics for development work:

Strong code generation: GPT-5 has been trained on an enormous volume of open-source code and is very effective at generating boilerplate, implementing standard patterns, and producing working code quickly
Strong ecosystem knowledge: Knows popular libraries, frameworks, and APIs extremely well (React, Django, FastAPI, pandas, etc.)
Good "developer experience" intuition: Often produces code that feels idiomatic to experienced developers — it has absorbed community conventions
Slightly more permissive: Will attempt tasks even when the requirements are ambiguous. This is sometimes useful, sometimes leads to incorrect assumptions

The o-series — Extended Thinking Models

The o3 and o4-mini models represent a fundamentally different approach. Before producing output, they run an internal "thinking" phase (sometimes called chain-of-thought or scratchpad reasoning) that is NOT shown to the user but influences the final answer.

What this means in practice:

Response latency is higher (seconds longer than GPT-5)
Quality on multi-step logical problems is significantly better
Mathematical reasoning, algorithm design, and complex debugging all improve
Overkill for simple tasks

o4-mini is the sweet spot for most coding workflows — it has nearly o3-level reasoning at roughly 10% of the cost.

code

GPT-5:   Fast + broad knowledge = good for most tasks
o4-mini: Slower + deeper reasoning = good for hard problems at low cost
o3:      Slowest + deepest reasoning = good for the hardest problems

Code Example: Rapid Prototyping Task

Scenario: You need a FastAPI endpoint that accepts a CSV upload, validates that it has the required columns, and returns summary statistics. You need it working in 5 minutes.

Prompt to GPT-5:

code

Create a FastAPI endpoint POST /upload-csv that:
- Accepts a CSV file upload
- Validates it contains columns: user_id, amount, date
- Returns JSON with: row_count, total_amount, date_range (min/max date)
- Returns 400 with a clear error message if columns are missing

Use pandas for the CSV processing. Keep it concise.

GPT-5 Output:

python

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import pandas as pd
import io
 
app = FastAPI()
 
REQUIRED_COLUMNS = {"user_id", "amount", "date"}
 
 
@app.post("/upload-csv")
async def upload_csv(file: UploadFile = File(...)):
    contents = await file.read()
    try:
        df = pd.read_csv(io.BytesIO(contents))
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Invalid CSV: {e}")
 
    missing = REQUIRED_COLUMNS - set(df.columns)
    if missing:
        raise HTTPException(
            status_code=400,
            detail=f"Missing required columns: {sorted(missing)}"
        )
 
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
 
    return JSONResponse({
        "row_count": len(df),
        "total_amount": round(df["amount"].sum(), 2),
        "date_range": {
            "min": str(df["date"].min().date()),
            "max": str(df["date"].max().date())
        }
    })

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import pandas as pd
import io
 
app = FastAPI()
 
REQUIRED_COLUMNS = {"user_id", "amount", "date"}
 
 
@app.post("/upload-csv")
async def upload_csv(file: UploadFile = File(...)):
    contents = await file.read()
    try:
        df = pd.read_csv(io.BytesIO(contents))
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Invalid CSV: {e}")
 
    missing = REQUIRED_COLUMNS - set(df.columns)
    if missing:
        raise HTTPException(
            status_code=400,
            detail=f"Missing required columns: {sorted(missing)}"
        )
 
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
 
    return JSONResponse({
        "row_count": len(df),
        "total_amount": round(df["amount"].sum(), 2),
        "date_range": {
            "min": str(df["date"].min().date()),
            "max": str(df["date"].max().date())
        }
    })

Why GPT-5 excels here: The output is idiomatic, complete, and runnable on the first try with no modifications. It correctly handled the async pattern for FastAPI file uploads, added type coercion for robustness, and formatted the response as a developer would expect. Time to working prototype: under 2 minutes.

Code Example: o4-mini for Hard Algorithmic Reasoning

Prompt to o4-mini:

code

I have a list of tasks, each with a start time, end time, and profit value.
I can only run one task at a time. Find the maximum profit I can earn.
This is the weighted job scheduling problem.
Write a Python solution that is O(n log n). Explain the approach first,
then give the code.

o4-mini Output (reasoning phase hidden, final answer):

code

Approach:
Sort jobs by end time. Use binary search to find, for each job,
the last job that doesn't overlap with it. Build a DP table where
dp[i] = max profit using jobs 0..i. For each job i:
  dp[i] = max(dp[i-1], jobs[i].profit + dp[last_non_overlapping(i)])

Time: O(n log n) for sort + O(n log n) for n binary searches = O(n log n)
Space: O(n) for DP array

python

import bisect
from dataclasses import dataclass
 
 
@dataclass
class Job:
    start: int
    end: int
    profit: int
 
 
def max_profit(jobs: list[Job]) -> int:
    """Find maximum profit from non-overlapping jobs (weighted job scheduling).
 
    Args:
        jobs: List of Job objects with start, end, and profit fields.
 
    Returns:
        Maximum achievable profit from a non-overlapping subset of jobs.
 
    Time complexity: O(n log n)
    Space complexity: O(n)
    """
    if not jobs:
        return 0
 
    jobs.sort(key=lambda j: j.end)
    end_times = [j.end for j in jobs]
    n = len(jobs)
    dp = [0] * (n + 1)  # dp[i] = max profit using first i jobs
 
    for i in range(1, n + 1):
        job = jobs[i - 1]
        # Find last job that ends <= current job's start time
        k = bisect.bisect_right(end_times, job.start, 0, i - 1)
        dp[i] = max(dp[i - 1], job.profit + dp[k])
 
    return dp[n]
 
 
# Example
jobs = [
    Job(start=1, end=3, profit=50),
    Job(start=3, end=5, profit=20),
    Job(start=6, end=19, profit=100),
    Job(start=2, end=100, profit=200),
]
print(max_profit(jobs))  # 250 (job 1 + job 3 = 50 + 100 + job 2 skip)

import bisect
from dataclasses import dataclass
 
 
@dataclass
class Job:
    start: int
    end: int
    profit: int
 
 
def max_profit(jobs: list[Job]) -> int:
    """Find maximum profit from non-overlapping jobs (weighted job scheduling).
 
    Args:
        jobs: List of Job objects with start, end, and profit fields.
 
    Returns:
        Maximum achievable profit from a non-overlapping subset of jobs.
 
    Time complexity: O(n log n)
    Space complexity: O(n)
    """
    if not jobs:
        return 0
 
    jobs.sort(key=lambda j: j.end)
    end_times = [j.end for j in jobs]
    n = len(jobs)
    dp = [0] * (n + 1)  # dp[i] = max profit using first i jobs
 
    for i in range(1, n + 1):
        job = jobs[i - 1]
        # Find last job that ends <= current job's start time
        k = bisect.bisect_right(end_times, job.start, 0, i - 1)
        dp[i] = max(dp[i - 1], job.profit + dp[k])
 
    return dp[n]
 
 
# Example
jobs = [
    Job(start=1, end=3, profit=50),
    Job(start=3, end=5, profit=20),
    Job(start=6, end=19, profit=100),
    Job(start=2, end=100, profit=200),
]
print(max_profit(jobs))  # 250 (job 1 + job 3 = 50 + 100 + job 2 skip)

Why o4-mini excels here: The solution is provably correct, uses the right algorithm (not a greedy approximation), and the extended thinking phase eliminated common mistakes like off-by-one errors in the binary search index.

When to Use OpenAI Models

GPT-5: Rapid prototyping, boilerplate generation, tasks needing broad framework knowledge, existing integrations using the OpenAI SDK
o4-mini: Algorithm design, debugging complex logic, math-heavy code, cost-effective reasoning tasks
o3: The hardest problems you face — when correctness matters more than speed and cost

When NOT to Use OpenAI Models

Massive context requirements (Claude and Gemini have larger windows)
Strict, unusual formatting requirements (o-series in particular can be verbose)
Privacy-sensitive on-premises work (use open-source alternatives)

Part 4: Google — Gemini Models (20 min)

The Gemini Lineup (2025-2026)

Model	Context Window	Key Feature	Best Use	Approx. Input Price	Approx. Output Price
Gemini 2.5 Pro	1M tokens	Largest context, multimodal	Massive codebase analysis, video/image+code	$1.25 / 1M tokens (up to 200K)	$10 / 1M tokens
Gemini 2.5 Flash	1M tokens	Fast, cheap, large context	High-volume tasks, long-doc processing	$0.075 / 1M tokens	$0.30 / 1M tokens

Note: Gemini 2.5 Pro pricing has a tiered structure — prompts over 200K tokens are charged at a higher rate. Verify current pricing at ai.google.dev/pricing.

What Makes Gemini Different

1. The 1 Million Token Context Window

One million tokens is approximately 750,000 words — roughly 2,500 pages of code or a complete large monorepo. In practical terms this means:

Load an entire 200-file Python project into a single prompt
Include all your test files, documentation, and configuration
Ask cross-cutting questions: "Which functions are called from more than 3 files?" or "Find all places we're not handling the database connection timeout"

2. Native Multimodality

Gemini was designed from the ground up to handle text, images, video, and audio in the same prompt. For developers this means:

Send a screenshot of a UI bug and ask "write the CSS to fix this"
Send an architecture diagram image and ask "generate the database schema for this ERD"
Send a video walkthrough of a bug and ask "what is causing this behavior?"

3. Speed at Scale

Gemini 2.5 Flash is exceptionally fast and cheap, making it suitable for high-volume automated pipelines — code review bots, CI comment generators, documentation scanners.

Code Example: Multi-Language Repository Analysis

Scenario: You have a repository with a Python FastAPI backend, a TypeScript React frontend, and a Go microservice for image processing. You want to understand all the API contracts and find any mismatches.

Prompt to Gemini 2.5 Pro:

code

I'm providing a multi-language repository: Python FastAPI backend, TypeScript
React frontend, and Go image processing service.

1. Extract all API endpoints defined in the backend (path, method, request schema, response schema)
2. Extract all API calls made from the frontend (URL pattern, expected response shape)
3. Extract all HTTP endpoints exposed by the Go service
4. Identify any mismatches between what the frontend expects and what the backend provides
5. Output results as structured JSON

[Full contents of api/routes.py — 800 lines]
[Full contents of api/schemas.py — 400 lines]
[Full contents of src/api/client.ts — 600 lines]
[Full contents of src/types/api.ts — 300 lines]
[Full contents of imageservice/main.go — 700 lines]
[Full contents of imageservice/handlers.go — 500 lines]

Gemini 2.5 Pro Output structure:

json

{
  "backend_endpoints": [
    {
      "path": "/api/v1/users/{user_id}",
      "method": "GET",
      "response_schema": {
        "id": "int",
        "email": "str",
        "display_name": "str | null",
        "created_at": "datetime"
      }
    },
    ...
  ],
  "frontend_api_calls": [
    {
      "file": "src/api/client.ts",
      "line": 47,
      "url_pattern": "/api/v1/users/{userId}",
      "expected_fields": ["id", "email", "name", "createdAt"]
    },
    ...
  ],
  "mismatches": [
    {
      "severity": "HIGH",
      "description": "Frontend expects field 'name' but backend returns 'display_name'",
      "frontend_location": "src/api/client.ts:47",
      "backend_location": "api/schemas.py:23"
    },
    {
      "severity": "MEDIUM",
      "description": "Frontend expects camelCase 'createdAt', backend returns snake_case 'created_at'. Verify your serializer is applying camelCase conversion.",
      "frontend_location": "src/types/api.ts:12",
      "backend_location": "api/schemas.py:26"
    }
  ]
}

Why Gemini excels here: The 1M token window allowed the entire relevant codebase to fit in a single prompt. Finding cross-language API contract mismatches typically requires custom tooling or manual review — Gemini can do it in one shot. No other model could have fit all six files simultaneously without chunking.

Gemini for Document-Heavy Developer Work

Beyond code, Gemini's large context shines for:

Dependency audits: Load all requirements.txt / package.json / go.mod files from a monorepo and find conflicting versions
Security scanning: Load all configuration files and find hardcoded credentials or unsafe defaults
Test coverage analysis: Load all test files and application code, ask "which functions have no corresponding test?"
Changelog generation: Load all commit messages from the last 6 months, produce a structured CHANGELOG

When to Use Gemini Models

Gemini 2.5 Pro: Any task requiring more than 100K tokens of context, multimodal tasks (image + code), multi-language repository analysis
Gemini 2.5 Flash: High-volume automated pipelines, cheap long-document processing, CI/CD integrations where cost-per-call matters

When NOT to Use Gemini Models

Strict instruction following on unusual formats (Claude tends to be more reliable here)
Complex multi-step reasoning without retrieval (o3/o4-mini tends to be stronger)
Privacy-sensitive work where you cannot send data to Google's servers

Part 5: Open Source — DeepSeek, Llama, Qwen (15 min)

The Open-Source Landscape (2025-2026)

"Open source" in the LLM context means the model weights are publicly released, allowing you to download and run the model on your own hardware without sending data to any external server.

Model	Parameters	Context	Strengths	License
DeepSeek R1	671B (MoE)	128K	Strong reasoning, competitive with o1-class	MIT
DeepSeek V3	671B (MoE)	128K	Strong coding, cheap API	MIT
Llama 4 Scout	17B active	10M	Massive context, efficient	Llama 4 Community
Llama 4 Maverick	17B active	1M	Multimodal, strong coding	Llama 4 Community
Qwen3 235B	22B active	128K	Multilingual, very strong coding	Apache 2.0
Qwen3 32B	32B	128K	Local-friendly, strong reasoning	Apache 2.0

Note: "MoE" means Mixture of Experts — only a subset of parameters are active per token, making the model more efficient to run than the parameter count implies.

Why Open Source Matters for Developers

Scenario 1: Air-gapped environments

You work at a bank, hospital, or government agency where no data can leave the internal network. You cannot call Anthropic, OpenAI, or Google APIs. Open-source models are your only option for AI-assisted development.

Scenario 2: Cost at scale

Running 100 million tokens per month through Claude Opus 4 would cost ~$8,750 in input costs alone. Running Qwen3 32B on a $10,000 GPU server pays for itself in under 2 months — and then you have permanent capacity.

Scenario 3: Fine-tuning

You want a model that specializes in your internal coding standards, your company's DSL, or your specific tech stack. Fine-tuning requires access to the weights — which only open-source models provide.

Scenario 4: Privacy and compliance

GDPR, HIPAA, and other regulations may prohibit sending certain data to third-party API providers. Self-hosted models solve this.

How to Run Models Locally with Ollama

Ollama is the simplest way to run open-source LLMs on your laptop. It handles model downloads, quantization, and serving a local API that is compatible with the OpenAI SDK.

Installation (macOS/Linux):

bash

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Pull a model (Qwen3 14B — good balance of quality and size)
ollama pull qwen3:14b
 
# Pull DeepSeek R1 distill (smaller, faster version)
ollama pull deepseek-r1:14b
 
# Pull Llama 4 Scout
ollama pull llama4:scout

Run a model interactively:

bash

ollama run qwen3:14b

Use Ollama's API (compatible with OpenAI SDK):

python

# Ollama serves an OpenAI-compatible API on localhost:11434
# You can use the openai library with no changes except the base_url
 
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the SDK but not used
)
 
response = client.chat.completions.create(
    model="qwen3:14b",
    messages=[
        {"role": "user", "content": "Refactor this function with type hints: def add(a, b): return a + b"}
    ]
)
 
print(response.choices[0].message.content)

Hardware requirements (approximate):

Model	RAM Required	Recommended GPU VRAM
Qwen3 8B (Q4 quantized)	8 GB RAM	6 GB VRAM
Qwen3 14B (Q4 quantized)	10 GB RAM	8 GB VRAM
Qwen3 32B (Q4 quantized)	20 GB RAM	16 GB VRAM
DeepSeek R1 14B (Q4)	10 GB RAM	8 GB VRAM
Llama 4 Scout (Q4)	12 GB RAM	10 GB VRAM

Q4 quantized = the model weights are compressed from 16-bit to 4-bit floats. This reduces memory by ~4x with modest quality loss — acceptable for most coding tasks.

DeepSeek R1 — The Reasoning Model

DeepSeek R1 deserves special attention because it is the first open-source model to match closed-source o1-class reasoning performance. Released in January 2025 by the Chinese AI lab DeepSeek, it shocked the industry with performance competitive with o1 at a fraction of the training cost.

Strengths: Mathematics, algorithmic reasoning, complex debugging Weaknesses: English instruction following is slightly less polished than Claude/GPT-5; can sometimes include Chinese text in mixed responses Key insight: DeepSeek R1 distilled variants (1.5B to 70B) are fine-tuned versions of the full model that run on consumer hardware while retaining most of the reasoning improvement

When to Use Open Source Models

Air-gapped or privacy-sensitive environments
High-volume tasks where API costs are prohibitive
Fine-tuning for domain-specific needs
Experimentation and research (freedom to inspect internals)
When you need permanent, offline capability

When NOT to Use Open Source Models

You need frontier-level quality and don't have the hardware
Setup time and maintenance overhead are not acceptable
Your team has no DevOps capacity to host and maintain a model server

Part 6: Live Experiments (20 min)

These four experiments are designed to be run live during class (or as take-home exercises if time is short). Each experiment isolates one dimension of model quality.

Experiment 1: Code Generation Quality

Model Comparison Radar — Code Quality, Reasoning, Context, Speed, Cost

Goal: Test whether models produce functionally correct, idiomatic code for a moderately complex task.

The prompt (send to each model):

code

Write a Python function that:
- Takes a list of dictionaries representing transactions: [{"id": str, "amount": float, "category": str}]
- Groups transactions by category
- For each category, computes: count, total, average, max, min
- Returns a dict keyed by category name
- Handles an empty list by returning {}
- Uses only the Python standard library (no pandas)
- Has type hints and a docstring

What to look for:

Criterion	Pass	Fail
Handles empty list	Returns `{}` without error	Raises KeyError or IndexError
Correct average	`total / count`	Division error on 0
Type hints present	`list[dict]`, `dict[str, dict]` or similar	Missing or incorrect
Idiomatic grouping	`defaultdict` or `dict.setdefault`	Nested `if` chains
Docstring present	Yes	No

Expected winner: All frontier models should pass all criteria. Look at code elegance and whether they use defaultdict (idiomatic) vs repetitive if key not in result patterns.

Experiment 2: Long-Context Faithfulness

Goal: Test whether a model can accurately answer questions about content deep in a long context.

Setup:

Take any moderately large Python project you have (or use a popular open-source project)
Concatenate 20-40 Python files into a single string (preserve filenames as headers)
Insert a "needle" — a deliberately unusual comment in one of the middle files:
python
```
# AUDIT_MARKER: This function was reviewed by Sarah Chen on 2024-03-15
```
# AUDIT_MARKER: This function was reviewed by Sarah Chen on 2024-03-15
Ask the model: "Find every function in this codebase that has been marked with an AUDIT_MARKER comment. List the function name, file, and reviewer."

What to look for:

Does the model find the marker even when it appears 50,000 tokens into the context?
Does it confidently say "no markers found" when you gave it zero markers? (hallucination test)
Does it extract the exact information (name, file, reviewer) or paraphrase?

Expected winners: Claude and Gemini — both have documented high scores on Needle in a Haystack benchmarks at long context lengths. GPT-5 is strong at 128K. DeepSeek R1 may struggle past 64K.

Experiment 3: Instruction Following — JSON Schema Adherence

Goal: Test whether a model returns EXACTLY what you asked for and nothing else.

The prompt:

code

Analyze this Python function and return ONLY a JSON object. Do not include
any explanation, markdown code blocks, or additional text. The JSON must
have exactly these fields and no others:
{
  "has_type_hints": boolean,
  "cyclomatic_complexity": integer (1-10 scale estimate),
  "has_docstring": boolean,
  "potential_bugs": array of strings
}

Function to analyze:
def calculate_discount(price, discount_pct, min_price=0):
    discounted = price - (price * discount_pct / 100)
    if discounted < min_price:
        return min_price
    return discounted

What to look for (ordered by failure severity):

Critical fail: Returns non-JSON text ("Sure! Here is the analysis...")
Major fail: Wraps JSON in markdown code fences (json ... )
Minor fail: Adds extra fields not in the spec ("suggestions", "overall_score")
Pass: Returns only the bare JSON object with exactly those four fields

Expected behavior by model:

Claude: Very high probability of returning bare JSON with exactly the specified fields
GPT-5: High probability, but may add markdown fences
o4-mini: High probability, but may add a brief preamble
Gemini: Usually good, occasionally adds "note" or "analysis" fields
DeepSeek R1: Good, but small chance of including Chinese analysis text

Teaching moment: This experiment reveals why system prompts in production apps say things like "You MUST respond with ONLY valid JSON. No markdown. No explanation. No preamble. If you cannot answer, return {\"error\": \"reason\"}." — because even top models need explicit reinforcement.

Experiment 4: Reasoning Chains — Complex Debugging

Goal: Diagnose why a subtle bug exists and propose the correct fix.

The buggy code:

python

import threading
from collections import defaultdict
 
class RequestCounter:
    def __init__(self):
        self.counts = defaultdict(int)
 
    def increment(self, endpoint: str):
        self.counts[endpoint] += 1
 
    def get_total(self) -> int:
        return sum(self.counts.values())
 
counter = RequestCounter()
 
def handle_request(endpoint: str):
    counter.increment(endpoint)
 
# Simulate 1000 concurrent requests
threads = [threading.Thread(target=handle_request, args=("/api/data",))
           for _ in range(1000)]
for t in threads:
    t.start()
for t in threads:
    t.join()
 
print(counter.get_total())  # Expect 1000, sometimes get less

The prompt:

code

This code sometimes prints a number less than 1000.
Identify the root cause and provide the corrected code.

The correct answer: The += operation on defaultdict is not atomic. It involves a read, increment, and write — three separate operations. With multiple threads, a race condition occurs: two threads can read the same value, both increment it, and write back the same incremented value, losing one count. Fix: use threading.Lock() around the increment operation, or replace with threading.local() plus aggregation, or use collections.Counter with a lock.

What to look for:

Does the model identify "race condition" and "non-atomic operation"?
Does it correctly explain the read-modify-write sequence?
Does it propose threading.Lock() as the solution?
Does it explain WHY the fix works, not just provide code?

Expected winner: o4-mini and Claude Opus 4 tend to produce the clearest reasoning chains here. GPT-5 usually gets it right but may be less precise about the atomicity explanation. Gemini is also strong. DeepSeek R1 is very good at this class of problem.

Comprehensive Model Comparison Table

Dimension	Claude Opus 4	GPT-5	o4-mini	Gemini 2.5 Pro	DeepSeek R1	Qwen3 32B
Context window	200K	128K	128K	1M	128K	128K
Instruction following	Excellent	Very Good	Good	Good	Good	Good
Code generation	Excellent	Excellent	Very Good	Very Good	Very Good	Good
Reasoning / algorithms	Excellent	Very Good	Excellent	Very Good	Excellent	Very Good
Long-context faithfulness	Excellent	Very Good	Very Good	Excellent	Good	Good
JSON/format adherence	Excellent	Very Good	Good	Good	Good	Good
Multimodal (vision)	No*	Yes (GPT-4o)	No	Yes	No	Yes
Self-hosted	No	No	No	No	Yes	Yes
Approx. input price	$15/M	$2.50/M	$1.10/M	$1.25/M	$0.55/M	Free (local)
Best for	Large refactors, precise formatting	Rapid prototyping, general work	Hard reasoning, low cost	Huge context, multi-language	Air-gapped, reasoning	Privacy, local dev

*Claude supports vision via separate API features as of 2025.

Key Takeaways

No model is universally best. The right model depends on your task, context requirements, budget, and privacy constraints.
Claude leads on instruction following and multi-file architectural work. Use it when correctness and precision matter.
GPT-5 leads on developer experience and broad framework knowledge. Use it for fast prototyping and general coding.
o4-mini is the best value for reasoning-intensive tasks — nearly o3 quality at 10% the cost.
Gemini 2.5 Pro is unmatched for massive-context tasks. If your prompt won't fit anywhere else, send it to Gemini.
Open source models (DeepSeek R1, Qwen3, Llama 4) have reached production quality and are essential for privacy-sensitive and cost-sensitive workloads.
Model "personalities" are real — Claude is precise, GPT-5 is pragmatic, Gemini is thorough, DeepSeek R1 is terse. Use this to your advantage.

Common Mistakes to Avoid

Assuming benchmark scores = real-world performance. Benchmark scores are useful signal but optimized. Always run your own representative tasks.
Using the flagship model for everything. Opus 4 costs 10x more than Sonnet 4. For simple tasks, the cheaper model is often equivalent.
Sending the same prompt to every model. Different models respond to different prompt styles. What works for Claude may underperform on GPT-5. Tune your prompts per model.
Ignoring context window costs. Loading 150K tokens into every request adds up fast. Be strategic about what you include in context.
Treating open-source as "worse." For many developer tasks, Qwen3 32B or DeepSeek R1 14B running locally produces output indistinguishable from paid APIs.

Homework / Self-Study

Run Experiment 3 yourself. Pick any two models you have API access to. Run the exact JSON instruction-following prompt. Document whether they add markdown fences, extra fields, or preamble text. Bring your findings to the next class.
Install Ollama. Follow the setup instructions in Part 5. Pull qwen3:8b (small enough for most laptops). Ask it to refactor a function from your own codebase. Note the quality difference vs. any cloud model you have access to.
Read the pricing pages for Anthropic, OpenAI, and Google AI Developer. Write down the current price per 1M output tokens for each frontier model. Prices change frequently — this exercise builds the habit of checking.

Checkpoint: Model Selection Rationale

Your task: Write a 1-paragraph model selection rationale for your project.

You are building [describe your project or choose one of these: a personal finance tracker web app / an internal code review bot / a local document Q&A tool for a law firm / a data pipeline for processing medical records].

Your paragraph must answer:

Which model (or combination of models) will you use?
Why does this model's strengths align with your specific use case?
What are the top 2 constraints driving your choice (context window, cost, privacy, reasoning depth, instruction following, multimodality)?
Which model did you explicitly choose NOT to use, and why?

Example rationale (for a code review bot):

"For our internal code review bot, I would use Claude Sonnet 4 as the primary model with Claude Haiku 3.5 as a fallback for simple formatting checks. Sonnet 4's combination of precise instruction following and 200K token context allows it to load an entire pull request — including the diff, referenced files, and our company style guide — in a single prompt, which is critical for accurate cross-file analysis. The two primary constraints are instruction following precision (we need structured JSON output for every review comment) and cost at scale (we process 50+ PRs daily). I explicitly chose not to use Gemini 2.5 Pro despite its larger context window because our data cannot leave our internal network, and we cannot self-host Gemini; and I ruled out o4-mini because its slightly lower instruction adherence increases the risk of malformed JSON crashing our CI pipeline."

This rationale format — model choice, strength alignment, constraints, explicit rejections — is what professional engineers write in architecture decision records (ADRs) when adopting AI tooling.

Next Lesson Preview

In Lesson 4: Prompt Engineering for Developers, we will:

Write system prompts that reliably produce structured output
Learn the difference between zero-shot, few-shot, and chain-of-thought prompting
Build a prompt template library for the most common developer tasks
Understand why temperature, top-p, and max-tokens matter for code generation

Back to Module Overview

Lesson 3: The Big Three Providers — Live Model Comparison

Learning Objectives

Prerequisites

Lesson Outline

Part 1: Cold Open — Same Prompt, 4 Models (15 min)

The Setup

Side-by-Side Outputs

What Do We Observe?

Part 2: Anthropic — Claude Models (25 min)

The Claude Family (2025-2026)

What Makes Claude Different

Code Example: Multi-File Refactoring Task

When to Use Claude

When NOT to Use Claude

Part 3: OpenAI — GPT and o-series Models (25 min)

The OpenAI Model Lineup (2025-2026)

The GPT Family — Standard Models

The o-series — Extended Thinking Models

Code Example: Rapid Prototyping Task

Code Example: o4-mini for Hard Algorithmic Reasoning

When to Use OpenAI Models

When NOT to Use OpenAI Models

Part 4: Google — Gemini Models (20 min)

The Gemini Lineup (2025-2026)

What Makes Gemini Different

Code Example: Multi-Language Repository Analysis

Gemini for Document-Heavy Developer Work

When to Use Gemini Models

When NOT to Use Gemini Models

Part 5: Open Source — DeepSeek, Llama, Qwen (15 min)

The Open-Source Landscape (2025-2026)

Why Open Source Matters for Developers

How to Run Models Locally with Ollama

DeepSeek R1 — The Reasoning Model

When to Use Open Source Models

When NOT to Use Open Source Models

Part 6: Live Experiments (20 min)

Experiment 1: Code Generation Quality

Experiment 2: Long-Context Faithfulness

Experiment 3: Instruction Following — JSON Schema Adherence

Experiment 4: Reasoning Chains — Complex Debugging

Comprehensive Model Comparison Table

Key Takeaways

Common Mistakes to Avoid

Homework / Self-Study

Checkpoint: Model Selection Rationale

Next Lesson Preview

Concept Map

Try it yourself