JustLearn
AI-Powered Development: Developer Track
Beginner2 hours

Lesson 3: The Big Three Providers — Live Model Comparison

Course: AI-Powered Development (Dev Track) | Duration: 2 hours | Level: Beginner

Learning Objectives

By the end of this lesson, students will be able to:

  • Distinguish the key strengths, weaknesses, and pricing of Claude, GPT/o-series, and Gemini
  • Explain what "frontier model" means and why it matters for development work
  • Select the right model for a given coding task based on objective criteria
  • Run an open-source model locally using Ollama
  • Write a clear model selection rationale for their own project

Prerequisites

  • Lesson 1: What is an LLM? (understanding tokenization, context windows, temperature)
  • Lesson 2: Model Families and Versioning (understanding Opus/Sonnet/Haiku-style tiers)
  • An API key for at least one provider (Anthropic, OpenAI, or Google) is helpful but not required

Lesson Outline

AI Provider Landscape — Anthropic, OpenAI, Google, Open Source

Part 1: Cold Open — Same Prompt, 4 Models (15 min)

The Setup

The fastest way to understand model differences is not to read benchmark tables — it is to watch them solve the same problem side by side. Before any theory, let's run an experiment.

The Prompt (sent verbatim to four models):

code
You are a senior Python engineer. Refactor the following function so that it:
1. Has proper type hints
2. Handles edge cases (empty input, None, non-integer items in the list)
3. Is well-documented with a Google-style docstring
4. Does not change the observable behavior for valid inputs
5. Returns a list, not a generator

Original code:
def process(data):
    return (x * 2 for x in data if x > 0)

Respond with ONLY the refactored code, no explanation.

Side-by-Side Outputs

Claude Opus 4 (Anthropic):

python
from typing import Optional
 
 
def process(data: Optional[list]) -> list:
    """Double all positive integers in a sequence.
 
    Args:
        data: A list of values to process. May be None or contain
              non-integer items, which are silently skipped.
 
    Returns:
        A new list containing each positive integer from `data`
        multiplied by 2, in input order. Returns an empty list if
        `data` is None, empty, or contains no positive integers.
 
    Examples:
        >>> process([1, -2, 3, "x", None])
        [2, 6]
        >>> process([])
        []
        >>> process(None)
        []
    """
    if not data:
        return []
    return [x * 2 for x in data if isinstance(x, int) and x > 0]

GPT-5 (OpenAI):

python
from typing import Any
 
 
def process(data: list[Any] | None) -> list[int]:
    """Double all positive integers in the input list.
 
    Args:
        data: Input list. Non-integer items and non-positive integers
              are excluded. None is treated as an empty list.
 
    Returns:
        List of doubled positive integers.
 
    Examples:
        >>> process([1, 2, -1, "skip", None])
        [2, 4]
    """
    if data is None:
        return []
    return [x * 2 for x in data if isinstance(x, int) and x > 0]

Gemini 2.5 Pro (Google):

python
from typing import Any
 
 
def process(data: list[Any] | None = None) -> list[int]:
    """Doubles all positive integers found in the input data list.
 
    Iterates through the provided list, filters for items that are
    positive integers, doubles them, and returns the results as a list.
    Non-integer types and non-positive integers are silently ignored.
 
    Args:
        data: A list of items to process. Items that are not integers
              or are not positive will be ignored. Defaults to None.
 
    Returns:
        A list of integers, where each element is double the value
        of a positive integer found in `data`. Returns an empty list
        if `data` is None, empty, or contains no qualifying items.
 
    Examples:
        >>> process([1, 2, 3])
        [2, 4, 6]
        >>> process([1, -1, 0, 'hello', None, 2])
        [2, 4]
        >>> process([])
        []
        >>> process(None)
        []
    """
    if not data:
        return []
    return [item * 2 for item in data if isinstance(item, int) and item > 0]

DeepSeek R1 (DeepSeek):

python
def process(data: list | None) -> list[int]:
    """Process a list by doubling positive integers.
 
    Args:
        data: Input list, may contain any types or be None.
 
    Returns:
        List of positive integers from data each multiplied by 2.
        Empty list if data is None, empty, or has no positive ints.
    """
    if not data:
        return []
    return [x * 2 for x in data if isinstance(x, int) and x > 0]

What Do We Observe?

DimensionClaude Opus 4GPT-5Gemini 2.5 ProDeepSeek R1
Instruction followed exactly?YesYesAlmost (added default arg)Yes
Docstring qualityExcellent (with doctests)GoodVerbose but thoroughConcise
Type hint styleOptional[list] (3.9 compat)`list[Any]None` (3.10+)`list[Any]
Edge case: NoneHandledHandledHandled + default argHandled
Edge case: non-integerHandledHandledHandledHandled
Token count (approx)280190320150
Unexpected additionsDoctests in docstringNoneDefault parameterNone

Key observations:

  • All four models produced functionally correct, production-quality code
  • Claude added executable doctests unprompted — useful but technically outside the spec
  • Gemini added = None as a default parameter — a reasonable interpretation but technically changes the function signature
  • DeepSeek R1 was the most terse and literal — least opinionated
  • GPT-5 used the most modern Python union syntax (|) by default

Teacher's Note: Ask the class: "Which output would you submit to a code review? Which surprised you?" There is no wrong answer here — the goal is to build intuition about model "personalities."

Part 2: Anthropic — Claude Models (25 min)

The Claude Family (2025-2026)

Anthropic releases models in a tiered structure. As of early 2026, the primary models available are:

ModelContext WindowBest UseApprox. Input PriceApprox. Output Price
Claude Opus 4200K tokensComplex reasoning, large codebase work$15 / 1M tokens$75 / 1M tokens
Claude Sonnet 4200K tokensDaily coding, balanced speed/quality$3 / 1M tokens$15 / 1M tokens
Claude Haiku 3.5200K tokensFast, cheap, simple tasks$0.80 / 1M tokens$4 / 1M tokens

Note: Prices listed are approximate as of Q1 2026. Always verify current pricing at console.anthropic.com/pricing.

What Makes Claude Different

1. Instruction following

Claude is widely regarded as the strongest model for precise, multi-step instruction following. In benchmarks and anecdotal developer experience, Claude is less likely to "drift" — to add unrequested content, change a constraint mid-response, or silently ignore a negative instruction ("do NOT include X").

This matters enormously in coding workflows where you write system prompts like: "Never add comments. Always return only the changed lines. Never rename variables."

2. Long-context faithfulness (200K tokens)

200,000 tokens is approximately 150,000 words — roughly 500 pages of code. Claude can:

  • Accept an entire medium-sized codebase in a single prompt
  • Answer questions about cross-file dependencies
  • Produce a multi-file diff with consistent variable naming across files

Claude's context faithfulness (how well it uses content from early in a long prompt) is measured in the "Needle in a Haystack" benchmark. Claude consistently scores near the top.

3. Multi-file architectural planning

This is where Claude particularly shines. Given a complex refactoring task spanning 10+ files, Claude tends to:

  • Recognize implicit constraints (e.g., a function in utils.py is called from 4 other files — renaming it requires updating all 4)
  • Produce a reasoning trace before the code (if asked)
  • Maintain consistent design decisions across a long response

4. Constitutional AI background

Anthropic was founded with a focus on AI safety research. Claude's training includes techniques designed to make it more honest (less likely to hallucinate confidently), more helpful, and less harmful. For production coding work, this translates to Claude being more willing to say "I'm not sure" or "this might not be the best approach" rather than confidently generating buggy code.

Code Example: Multi-File Refactoring Task

Scenario: You have a Django app where the User model has a full_name property computed from first_name and last_name. A new requirement says you need to support a display_name field that can be set independently. You need Claude to identify all the places this will break and produce a migration plan.

Prompt to Claude Opus 4:

code
I have a Django app. Here are 4 files. I need to add a `display_name` field
to the User model that:
- Is optional (nullable)
- Falls back to `first_name + " " + last_name` if not set
- Must be updated in the admin, serializers, and all templates that use `full_name`

Analyze the impact and give me the complete list of changes needed, then
produce each changed file in full.

[FILE: models.py]
[FILE: admin.py]
[FILE: serializers.py]
[FILE: templates/profile.html]

What Claude produces (summary of output structure):

code
Impact Analysis:
1. models.py — Add `display_name = models.CharField(max_length=200, blank=True, null=True)`
   Update `full_name` property to use display_name if set.
   Add database migration.

2. admin.py — Add `display_name` to `fieldsets` and `list_display`.

3. serializers.py — Add `display_name` to fields list.
   Update `full_name` SerializerMethodField to call model property.

4. templates/profile.html — No change needed (uses `{{ user.full_name }}`
   which now goes through the updated property).

Migration risk: LOW — existing `full_name` callers all go through the property,
so no data migration required for existing records.

--- models.py (complete) ---
[full file content...]

--- admin.py (complete) ---
[full file content...]

--- serializers.py (complete) ---
[full file content...]

--- 0003_user_display_name.py (migration) ---
[full migration file content...]

Why this showcases Claude's strength: It identified that templates/profile.html did NOT need changing (because it called full_name which is a property — not the database field directly). A weaker model would have either missed the template entirely or incorrectly modified it.

When to Use Claude

  • Large codebase navigation (loading 50+ files into context)
  • Complex, multi-step refactors requiring architectural awareness
  • Tasks with strict output format requirements ("return ONLY valid JSON")
  • Security-sensitive code review (Claude tends to flag issues proactively)
  • Writing technical documentation from code

When NOT to Use Claude

  • Rapid fire, cheap API calls at scale (cost can add up — use Haiku or Sonnet)
  • Real-time streaming chat UIs (latency for Opus is higher)
  • Tasks requiring internet access (Claude cannot browse the web natively)

Part 3: OpenAI — GPT and o-series Models (25 min)

The OpenAI Model Lineup (2025-2026)

OpenAI maintains two distinct model families with different reasoning approaches:

ModelContext WindowReasoning TypeBest UseApprox. Input PriceApprox. Output Price
GPT-5128K tokensStandardGeneral coding, chat, prototyping$2.50 / 1M tokens$10 / 1M tokens
o3200K tokensExtended thinkingMath, algorithms, hard logic$10 / 1M tokens$40 / 1M tokens
o4-mini128K tokensExtended thinking (fast)Coding + reasoning at low cost$1.10 / 1M tokens$4.40 / 1M tokens
GPT-4o128K tokensStandard (multimodal)Vision tasks, existing integrations$2.50 / 1M tokens$10 / 1M tokens

Note: Prices are approximate as of Q1 2026. Verify at platform.openai.com/docs/pricing.

The GPT Family — Standard Models

GPT-5 is the flagship general-purpose model. Its key characteristics for development work:

  • Strong code generation: GPT-5 has been trained on an enormous volume of open-source code and is very effective at generating boilerplate, implementing standard patterns, and producing working code quickly
  • Strong ecosystem knowledge: Knows popular libraries, frameworks, and APIs extremely well (React, Django, FastAPI, pandas, etc.)
  • Good "developer experience" intuition: Often produces code that feels idiomatic to experienced developers — it has absorbed community conventions
  • Slightly more permissive: Will attempt tasks even when the requirements are ambiguous. This is sometimes useful, sometimes leads to incorrect assumptions

The o-series — Extended Thinking Models

The o3 and o4-mini models represent a fundamentally different approach. Before producing output, they run an internal "thinking" phase (sometimes called chain-of-thought or scratchpad reasoning) that is NOT shown to the user but influences the final answer.

What this means in practice:

  • Response latency is higher (seconds longer than GPT-5)
  • Quality on multi-step logical problems is significantly better
  • Mathematical reasoning, algorithm design, and complex debugging all improve
  • Overkill for simple tasks

o4-mini is the sweet spot for most coding workflows — it has nearly o3-level reasoning at roughly 10% of the cost.

code
GPT-5:   Fast + broad knowledge = good for most tasks
o4-mini: Slower + deeper reasoning = good for hard problems at low cost
o3:      Slowest + deepest reasoning = good for the hardest problems

Code Example: Rapid Prototyping Task

Scenario: You need a FastAPI endpoint that accepts a CSV upload, validates that it has the required columns, and returns summary statistics. You need it working in 5 minutes.

Prompt to GPT-5:

code
Create a FastAPI endpoint POST /upload-csv that:
- Accepts a CSV file upload
- Validates it contains columns: user_id, amount, date
- Returns JSON with: row_count, total_amount, date_range (min/max date)
- Returns 400 with a clear error message if columns are missing

Use pandas for the CSV processing. Keep it concise.

GPT-5 Output:

python
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import pandas as pd
import io
 
app = FastAPI()
 
REQUIRED_COLUMNS = {"user_id", "amount", "date"}
 
 
@app.post("/upload-csv")
async def upload_csv(file: UploadFile = File(...)):
    contents = await file.read()
    try:
        df = pd.read_csv(io.BytesIO(contents))
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Invalid CSV: {e}")
 
    missing = REQUIRED_COLUMNS - set(df.columns)
    if missing:
        raise HTTPException(
            status_code=400,
            detail=f"Missing required columns: {sorted(missing)}"
        )
 
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
 
    return JSONResponse({
        "row_count": len(df),
        "total_amount": round(df["amount"].sum(), 2),
        "date_range": {
            "min": str(df["date"].min().date()),
            "max": str(df["date"].max().date())
        }
    })

Why GPT-5 excels here: The output is idiomatic, complete, and runnable on the first try with no modifications. It correctly handled the async pattern for FastAPI file uploads, added type coercion for robustness, and formatted the response as a developer would expect. Time to working prototype: under 2 minutes.

Code Example: o4-mini for Hard Algorithmic Reasoning

Prompt to o4-mini:

code
I have a list of tasks, each with a start time, end time, and profit value.
I can only run one task at a time. Find the maximum profit I can earn.
This is the weighted job scheduling problem.
Write a Python solution that is O(n log n). Explain the approach first,
then give the code.

o4-mini Output (reasoning phase hidden, final answer):

code
Approach:
Sort jobs by end time. Use binary search to find, for each job,
the last job that doesn't overlap with it. Build a DP table where
dp[i] = max profit using jobs 0..i. For each job i:
  dp[i] = max(dp[i-1], jobs[i].profit + dp[last_non_overlapping(i)])

Time: O(n log n) for sort + O(n log n) for n binary searches = O(n log n)
Space: O(n) for DP array
python
import bisect
from dataclasses import dataclass
 
 
@dataclass
class Job:
    start: int
    end: int
    profit: int
 
 
def max_profit(jobs: list[Job]) -> int:
    """Find maximum profit from non-overlapping jobs (weighted job scheduling).
 
    Args:
        jobs: List of Job objects with start, end, and profit fields.
 
    Returns:
        Maximum achievable profit from a non-overlapping subset of jobs.
 
    Time complexity: O(n log n)
    Space complexity: O(n)
    """
    if not jobs:
        return 0
 
    jobs.sort(key=lambda j: j.end)
    end_times = [j.end for j in jobs]
    n = len(jobs)
    dp = [0] * (n + 1)  # dp[i] = max profit using first i jobs
 
    for i in range(1, n + 1):
        job = jobs[i - 1]
        # Find last job that ends <= current job's start time
        k = bisect.bisect_right(end_times, job.start, 0, i - 1)
        dp[i] = max(dp[i - 1], job.profit + dp[k])
 
    return dp[n]
 
 
# Example
jobs = [
    Job(start=1, end=3, profit=50),
    Job(start=3, end=5, profit=20),
    Job(start=6, end=19, profit=100),
    Job(start=2, end=100, profit=200),
]
print(max_profit(jobs))  # 250 (job 1 + job 3 = 50 + 100 + job 2 skip)

Why o4-mini excels here: The solution is provably correct, uses the right algorithm (not a greedy approximation), and the extended thinking phase eliminated common mistakes like off-by-one errors in the binary search index.

When to Use OpenAI Models

  • GPT-5: Rapid prototyping, boilerplate generation, tasks needing broad framework knowledge, existing integrations using the OpenAI SDK
  • o4-mini: Algorithm design, debugging complex logic, math-heavy code, cost-effective reasoning tasks
  • o3: The hardest problems you face — when correctness matters more than speed and cost

When NOT to Use OpenAI Models

  • Massive context requirements (Claude and Gemini have larger windows)
  • Strict, unusual formatting requirements (o-series in particular can be verbose)
  • Privacy-sensitive on-premises work (use open-source alternatives)

Part 4: Google — Gemini Models (20 min)

The Gemini Lineup (2025-2026)

ModelContext WindowKey FeatureBest UseApprox. Input PriceApprox. Output Price
Gemini 2.5 Pro1M tokensLargest context, multimodalMassive codebase analysis, video/image+code$1.25 / 1M tokens (up to 200K)$10 / 1M tokens
Gemini 2.5 Flash1M tokensFast, cheap, large contextHigh-volume tasks, long-doc processing$0.075 / 1M tokens$0.30 / 1M tokens

Note: Gemini 2.5 Pro pricing has a tiered structure — prompts over 200K tokens are charged at a higher rate. Verify current pricing at ai.google.dev/pricing.

What Makes Gemini Different

1. The 1 Million Token Context Window

One million tokens is approximately 750,000 words — roughly 2,500 pages of code or a complete large monorepo. In practical terms this means:

  • Load an entire 200-file Python project into a single prompt
  • Include all your test files, documentation, and configuration
  • Ask cross-cutting questions: "Which functions are called from more than 3 files?" or "Find all places we're not handling the database connection timeout"

2. Native Multimodality

Gemini was designed from the ground up to handle text, images, video, and audio in the same prompt. For developers this means:

  • Send a screenshot of a UI bug and ask "write the CSS to fix this"
  • Send an architecture diagram image and ask "generate the database schema for this ERD"
  • Send a video walkthrough of a bug and ask "what is causing this behavior?"

3. Speed at Scale

Gemini 2.5 Flash is exceptionally fast and cheap, making it suitable for high-volume automated pipelines — code review bots, CI comment generators, documentation scanners.

Code Example: Multi-Language Repository Analysis

Scenario: You have a repository with a Python FastAPI backend, a TypeScript React frontend, and a Go microservice for image processing. You want to understand all the API contracts and find any mismatches.

Prompt to Gemini 2.5 Pro:

code
I'm providing a multi-language repository: Python FastAPI backend, TypeScript
React frontend, and Go image processing service.

1. Extract all API endpoints defined in the backend (path, method, request schema, response schema)
2. Extract all API calls made from the frontend (URL pattern, expected response shape)
3. Extract all HTTP endpoints exposed by the Go service
4. Identify any mismatches between what the frontend expects and what the backend provides
5. Output results as structured JSON

[Full contents of api/routes.py — 800 lines]
[Full contents of api/schemas.py — 400 lines]
[Full contents of src/api/client.ts — 600 lines]
[Full contents of src/types/api.ts — 300 lines]
[Full contents of imageservice/main.go — 700 lines]
[Full contents of imageservice/handlers.go — 500 lines]

Gemini 2.5 Pro Output structure:

json
{
  "backend_endpoints": [
    {
      "path": "/api/v1/users/{user_id}",
      "method": "GET",
      "response_schema": {
        "id": "int",
        "email": "str",
        "display_name": "str | null",
        "created_at": "datetime"
      }
    },
    ...
  ],
  "frontend_api_calls": [
    {
      "file": "src/api/client.ts",
      "line": 47,
      "url_pattern": "/api/v1/users/{userId}",
      "expected_fields": ["id", "email", "name", "createdAt"]
    },
    ...
  ],
  "mismatches": [
    {
      "severity": "HIGH",
      "description": "Frontend expects field 'name' but backend returns 'display_name'",
      "frontend_location": "src/api/client.ts:47",
      "backend_location": "api/schemas.py:23"
    },
    {
      "severity": "MEDIUM",
      "description": "Frontend expects camelCase 'createdAt', backend returns snake_case 'created_at'. Verify your serializer is applying camelCase conversion.",
      "frontend_location": "src/types/api.ts:12",
      "backend_location": "api/schemas.py:26"
    }
  ]
}

Why Gemini excels here: The 1M token window allowed the entire relevant codebase to fit in a single prompt. Finding cross-language API contract mismatches typically requires custom tooling or manual review — Gemini can do it in one shot. No other model could have fit all six files simultaneously without chunking.

Gemini for Document-Heavy Developer Work

Beyond code, Gemini's large context shines for:

  • Dependency audits: Load all requirements.txt / package.json / go.mod files from a monorepo and find conflicting versions
  • Security scanning: Load all configuration files and find hardcoded credentials or unsafe defaults
  • Test coverage analysis: Load all test files and application code, ask "which functions have no corresponding test?"
  • Changelog generation: Load all commit messages from the last 6 months, produce a structured CHANGELOG

When to Use Gemini Models

  • Gemini 2.5 Pro: Any task requiring more than 100K tokens of context, multimodal tasks (image + code), multi-language repository analysis
  • Gemini 2.5 Flash: High-volume automated pipelines, cheap long-document processing, CI/CD integrations where cost-per-call matters

When NOT to Use Gemini Models

  • Strict instruction following on unusual formats (Claude tends to be more reliable here)
  • Complex multi-step reasoning without retrieval (o3/o4-mini tends to be stronger)
  • Privacy-sensitive work where you cannot send data to Google's servers

Part 5: Open Source — DeepSeek, Llama, Qwen (15 min)

The Open-Source Landscape (2025-2026)

"Open source" in the LLM context means the model weights are publicly released, allowing you to download and run the model on your own hardware without sending data to any external server.

ModelParametersContextStrengthsLicense
DeepSeek R1671B (MoE)128KStrong reasoning, competitive with o1-classMIT
DeepSeek V3671B (MoE)128KStrong coding, cheap APIMIT
Llama 4 Scout17B active10MMassive context, efficientLlama 4 Community
Llama 4 Maverick17B active1MMultimodal, strong codingLlama 4 Community
Qwen3 235B22B active128KMultilingual, very strong codingApache 2.0
Qwen3 32B32B128KLocal-friendly, strong reasoningApache 2.0

Note: "MoE" means Mixture of Experts — only a subset of parameters are active per token, making the model more efficient to run than the parameter count implies.

Why Open Source Matters for Developers

Scenario 1: Air-gapped environments

You work at a bank, hospital, or government agency where no data can leave the internal network. You cannot call Anthropic, OpenAI, or Google APIs. Open-source models are your only option for AI-assisted development.

Scenario 2: Cost at scale

Running 100 million tokens per month through Claude Opus 4 would cost ~$8,750 in input costs alone. Running Qwen3 32B on a $10,000 GPU server pays for itself in under 2 months — and then you have permanent capacity.

Scenario 3: Fine-tuning

You want a model that specializes in your internal coding standards, your company's DSL, or your specific tech stack. Fine-tuning requires access to the weights — which only open-source models provide.

Scenario 4: Privacy and compliance

GDPR, HIPAA, and other regulations may prohibit sending certain data to third-party API providers. Self-hosted models solve this.

How to Run Models Locally with Ollama

Ollama is the simplest way to run open-source LLMs on your laptop. It handles model downloads, quantization, and serving a local API that is compatible with the OpenAI SDK.

Installation (macOS/Linux):

bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Pull a model (Qwen3 14B — good balance of quality and size)
ollama pull qwen3:14b
 
# Pull DeepSeek R1 distill (smaller, faster version)
ollama pull deepseek-r1:14b
 
# Pull Llama 4 Scout
ollama pull llama4:scout

Run a model interactively:

bash
ollama run qwen3:14b

Use Ollama's API (compatible with OpenAI SDK):

python
# Ollama serves an OpenAI-compatible API on localhost:11434
# You can use the openai library with no changes except the base_url
 
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the SDK but not used
)
 
response = client.chat.completions.create(
    model="qwen3:14b",
    messages=[
        {"role": "user", "content": "Refactor this function with type hints: def add(a, b): return a + b"}
    ]
)
 
print(response.choices[0].message.content)

Hardware requirements (approximate):

ModelRAM RequiredRecommended GPU VRAM
Qwen3 8B (Q4 quantized)8 GB RAM6 GB VRAM
Qwen3 14B (Q4 quantized)10 GB RAM8 GB VRAM
Qwen3 32B (Q4 quantized)20 GB RAM16 GB VRAM
DeepSeek R1 14B (Q4)10 GB RAM8 GB VRAM
Llama 4 Scout (Q4)12 GB RAM10 GB VRAM

Q4 quantized = the model weights are compressed from 16-bit to 4-bit floats. This reduces memory by ~4x with modest quality loss — acceptable for most coding tasks.

DeepSeek R1 — The Reasoning Model

DeepSeek R1 deserves special attention because it is the first open-source model to match closed-source o1-class reasoning performance. Released in January 2025 by the Chinese AI lab DeepSeek, it shocked the industry with performance competitive with o1 at a fraction of the training cost.

Strengths: Mathematics, algorithmic reasoning, complex debugging Weaknesses: English instruction following is slightly less polished than Claude/GPT-5; can sometimes include Chinese text in mixed responses Key insight: DeepSeek R1 distilled variants (1.5B to 70B) are fine-tuned versions of the full model that run on consumer hardware while retaining most of the reasoning improvement

When to Use Open Source Models

  • Air-gapped or privacy-sensitive environments
  • High-volume tasks where API costs are prohibitive
  • Fine-tuning for domain-specific needs
  • Experimentation and research (freedom to inspect internals)
  • When you need permanent, offline capability

When NOT to Use Open Source Models

  • You need frontier-level quality and don't have the hardware
  • Setup time and maintenance overhead are not acceptable
  • Your team has no DevOps capacity to host and maintain a model server

Part 6: Live Experiments (20 min)

These four experiments are designed to be run live during class (or as take-home exercises if time is short). Each experiment isolates one dimension of model quality.

Experiment 1: Code Generation Quality

Model Comparison Radar — Code Quality, Reasoning, Context, Speed, Cost

Goal: Test whether models produce functionally correct, idiomatic code for a moderately complex task.

The prompt (send to each model):

code
Write a Python function that:
- Takes a list of dictionaries representing transactions: [{"id": str, "amount": float, "category": str}]
- Groups transactions by category
- For each category, computes: count, total, average, max, min
- Returns a dict keyed by category name
- Handles an empty list by returning {}
- Uses only the Python standard library (no pandas)
- Has type hints and a docstring

What to look for:

CriterionPassFail
Handles empty listReturns {} without errorRaises KeyError or IndexError
Correct averagetotal / countDivision error on 0
Type hints presentlist[dict], dict[str, dict] or similarMissing or incorrect
Idiomatic groupingdefaultdict or dict.setdefaultNested if chains
Docstring presentYesNo

Expected winner: All frontier models should pass all criteria. Look at code elegance and whether they use defaultdict (idiomatic) vs repetitive if key not in result patterns.

Experiment 2: Long-Context Faithfulness

Goal: Test whether a model can accurately answer questions about content deep in a long context.

Setup:

  1. Take any moderately large Python project you have (or use a popular open-source project)
  2. Concatenate 20-40 Python files into a single string (preserve filenames as headers)
  3. Insert a "needle" — a deliberately unusual comment in one of the middle files:
    python
    # AUDIT_MARKER: This function was reviewed by Sarah Chen on 2024-03-15
  4. Ask the model: "Find every function in this codebase that has been marked with an AUDIT_MARKER comment. List the function name, file, and reviewer."

What to look for:

  • Does the model find the marker even when it appears 50,000 tokens into the context?
  • Does it confidently say "no markers found" when you gave it zero markers? (hallucination test)
  • Does it extract the exact information (name, file, reviewer) or paraphrase?

Expected winners: Claude and Gemini — both have documented high scores on Needle in a Haystack benchmarks at long context lengths. GPT-5 is strong at 128K. DeepSeek R1 may struggle past 64K.

Experiment 3: Instruction Following — JSON Schema Adherence

Goal: Test whether a model returns EXACTLY what you asked for and nothing else.

The prompt:

code
Analyze this Python function and return ONLY a JSON object. Do not include
any explanation, markdown code blocks, or additional text. The JSON must
have exactly these fields and no others:
{
  "has_type_hints": boolean,
  "cyclomatic_complexity": integer (1-10 scale estimate),
  "has_docstring": boolean,
  "potential_bugs": array of strings
}

Function to analyze:
def calculate_discount(price, discount_pct, min_price=0):
    discounted = price - (price * discount_pct / 100)
    if discounted < min_price:
        return min_price
    return discounted

What to look for (ordered by failure severity):

  1. Critical fail: Returns non-JSON text ("Sure! Here is the analysis...")
  2. Major fail: Wraps JSON in markdown code fences (json ... )
  3. Minor fail: Adds extra fields not in the spec ("suggestions", "overall_score")
  4. Pass: Returns only the bare JSON object with exactly those four fields

Expected behavior by model:

  • Claude: Very high probability of returning bare JSON with exactly the specified fields
  • GPT-5: High probability, but may add markdown fences
  • o4-mini: High probability, but may add a brief preamble
  • Gemini: Usually good, occasionally adds "note" or "analysis" fields
  • DeepSeek R1: Good, but small chance of including Chinese analysis text

Teaching moment: This experiment reveals why system prompts in production apps say things like "You MUST respond with ONLY valid JSON. No markdown. No explanation. No preamble. If you cannot answer, return {\"error\": \"reason\"}." — because even top models need explicit reinforcement.

Experiment 4: Reasoning Chains — Complex Debugging

Goal: Diagnose why a subtle bug exists and propose the correct fix.

The buggy code:

python
import threading
from collections import defaultdict
 
class RequestCounter:
    def __init__(self):
        self.counts = defaultdict(int)
 
    def increment(self, endpoint: str):
        self.counts[endpoint] += 1
 
    def get_total(self) -> int:
        return sum(self.counts.values())
 
counter = RequestCounter()
 
def handle_request(endpoint: str):
    counter.increment(endpoint)
 
# Simulate 1000 concurrent requests
threads = [threading.Thread(target=handle_request, args=("/api/data",))
           for _ in range(1000)]
for t in threads:
    t.start()
for t in threads:
    t.join()
 
print(counter.get_total())  # Expect 1000, sometimes get less

The prompt:

code
This code sometimes prints a number less than 1000.
Identify the root cause and provide the corrected code.

The correct answer: The += operation on defaultdict is not atomic. It involves a read, increment, and write — three separate operations. With multiple threads, a race condition occurs: two threads can read the same value, both increment it, and write back the same incremented value, losing one count. Fix: use threading.Lock() around the increment operation, or replace with threading.local() plus aggregation, or use collections.Counter with a lock.

What to look for:

  • Does the model identify "race condition" and "non-atomic operation"?
  • Does it correctly explain the read-modify-write sequence?
  • Does it propose threading.Lock() as the solution?
  • Does it explain WHY the fix works, not just provide code?

Expected winner: o4-mini and Claude Opus 4 tend to produce the clearest reasoning chains here. GPT-5 usually gets it right but may be less precise about the atomicity explanation. Gemini is also strong. DeepSeek R1 is very good at this class of problem.

Comprehensive Model Comparison Table

DimensionClaude Opus 4GPT-5o4-miniGemini 2.5 ProDeepSeek R1Qwen3 32B
Context window200K128K128K1M128K128K
Instruction followingExcellentVery GoodGoodGoodGoodGood
Code generationExcellentExcellentVery GoodVery GoodVery GoodGood
Reasoning / algorithmsExcellentVery GoodExcellentVery GoodExcellentVery Good
Long-context faithfulnessExcellentVery GoodVery GoodExcellentGoodGood
JSON/format adherenceExcellentVery GoodGoodGoodGoodGood
Multimodal (vision)No*Yes (GPT-4o)NoYesNoYes
Self-hostedNoNoNoNoYesYes
Approx. input price$15/M$2.50/M$1.10/M$1.25/M$0.55/MFree (local)
Best forLarge refactors, precise formattingRapid prototyping, general workHard reasoning, low costHuge context, multi-languageAir-gapped, reasoningPrivacy, local dev

*Claude supports vision via separate API features as of 2025.

Key Takeaways

  • No model is universally best. The right model depends on your task, context requirements, budget, and privacy constraints.
  • Claude leads on instruction following and multi-file architectural work. Use it when correctness and precision matter.
  • GPT-5 leads on developer experience and broad framework knowledge. Use it for fast prototyping and general coding.
  • o4-mini is the best value for reasoning-intensive tasks — nearly o3 quality at 10% the cost.
  • Gemini 2.5 Pro is unmatched for massive-context tasks. If your prompt won't fit anywhere else, send it to Gemini.
  • Open source models (DeepSeek R1, Qwen3, Llama 4) have reached production quality and are essential for privacy-sensitive and cost-sensitive workloads.
  • Model "personalities" are real — Claude is precise, GPT-5 is pragmatic, Gemini is thorough, DeepSeek R1 is terse. Use this to your advantage.

Common Mistakes to Avoid

  • Assuming benchmark scores = real-world performance. Benchmark scores are useful signal but optimized. Always run your own representative tasks.
  • Using the flagship model for everything. Opus 4 costs 10x more than Sonnet 4. For simple tasks, the cheaper model is often equivalent.
  • Sending the same prompt to every model. Different models respond to different prompt styles. What works for Claude may underperform on GPT-5. Tune your prompts per model.
  • Ignoring context window costs. Loading 150K tokens into every request adds up fast. Be strategic about what you include in context.
  • Treating open-source as "worse." For many developer tasks, Qwen3 32B or DeepSeek R1 14B running locally produces output indistinguishable from paid APIs.

Homework / Self-Study

  1. Run Experiment 3 yourself. Pick any two models you have API access to. Run the exact JSON instruction-following prompt. Document whether they add markdown fences, extra fields, or preamble text. Bring your findings to the next class.

  2. Install Ollama. Follow the setup instructions in Part 5. Pull qwen3:8b (small enough for most laptops). Ask it to refactor a function from your own codebase. Note the quality difference vs. any cloud model you have access to.

  3. Read the pricing pages for Anthropic, OpenAI, and Google AI Developer. Write down the current price per 1M output tokens for each frontier model. Prices change frequently — this exercise builds the habit of checking.

Checkpoint: Model Selection Rationale

Your task: Write a 1-paragraph model selection rationale for your project.

You are building [describe your project or choose one of these: a personal finance tracker web app / an internal code review bot / a local document Q&A tool for a law firm / a data pipeline for processing medical records].

Your paragraph must answer:

  1. Which model (or combination of models) will you use?
  2. Why does this model's strengths align with your specific use case?
  3. What are the top 2 constraints driving your choice (context window, cost, privacy, reasoning depth, instruction following, multimodality)?
  4. Which model did you explicitly choose NOT to use, and why?

Example rationale (for a code review bot):

"For our internal code review bot, I would use Claude Sonnet 4 as the primary model with Claude Haiku 3.5 as a fallback for simple formatting checks. Sonnet 4's combination of precise instruction following and 200K token context allows it to load an entire pull request — including the diff, referenced files, and our company style guide — in a single prompt, which is critical for accurate cross-file analysis. The two primary constraints are instruction following precision (we need structured JSON output for every review comment) and cost at scale (we process 50+ PRs daily). I explicitly chose not to use Gemini 2.5 Pro despite its larger context window because our data cannot leave our internal network, and we cannot self-host Gemini; and I ruled out o4-mini because its slightly lower instruction adherence increases the risk of malformed JSON crashing our CI pipeline."

This rationale format — model choice, strength alignment, constraints, explicit rejections — is what professional engineers write in architecture decision records (ADRs) when adopting AI tooling.

Next Lesson Preview

In Lesson 4: Prompt Engineering for Developers, we will:

  • Write system prompts that reliably produce structured output
  • Learn the difference between zero-shot, few-shot, and chain-of-thought prompting
  • Build a prompt template library for the most common developer tasks
  • Understand why temperature, top-p, and max-tokens matter for code generation

Back to Module Overview

Concept Map

Try it yourself

Write Python code below and click Run to execute it in your browser.