Lesson 9: Basic Statistics with pandas

Course: Data Engineering | Duration: 45 minutes | Level: Intermediate

Learning Objectives

By the end of this lesson, you will be able to:

Generate summary statistics for a DataFrame with .describe()
Compute column-level statistics: .mean(), .median(), .std(), .min(), .max(), .sum()
Count value frequencies with .value_counts() including normalized proportions
Compute pairwise correlations between numeric columns using .corr()
Count distinct values with .nunique() and list them with .unique()

Prerequisites

Lesson 8: Sorting and Ranking

Lesson Outline

Part 1: `.describe()` — Summary Statistics

.describe() produces a statistical summary for all numeric columns in one call. It is the first thing to run when you receive a new dataset.

python

import pandas as pd
 
df = pd.DataFrame({
    'product':  ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam',
                 'Laptop', 'Keyboard', 'Mouse', 'Monitor', 'USB Hub'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics',
                 'Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
    'quantity': [1, 2, 1, 1, 1, 2, 3, 1, 2, 4],
    'unit_price': [1200.0, 25.0, 75.0, 350.0, 89.0,
                   1200.0, 75.0, 25.0, 350.0, 15.0]
})
 
print(df.describe())
#        quantity   unit_price
# count  10.000000    10.00000
# mean    1.800000   240.40000
# std     0.918937   416.14873
# min     1.000000    15.00000
# 25%     1.000000    25.00000
# 50%     1.500000    82.00000
# 75%     2.000000   412.50000
# max     4.000000  1200.00000

Reading .describe() output:

count: non-null row count (if lower than expected, missing values exist)
mean: arithmetic mean
std: standard deviation — how spread out the values are
min / max: range of values
25% / 50% / 75%: quartiles — 50% is the median

Part 2: Column-Level Statistics

For individual statistics on a single column:

python

import pandas as pd
 
df = pd.DataFrame({
    'quantity':   [1, 2, 1, 1, 1, 2, 3, 1, 2, 4],
    'unit_price': [1200.0, 25.0, 75.0, 350.0, 89.0,
                   1200.0, 75.0, 25.0, 350.0, 15.0]
})
 
# Mean
print("Mean price:", df['unit_price'].mean())    # 240.4
 
# Median (middle value when sorted — robust to outliers)
print("Median price:", df['unit_price'].median())  # 82.0
 
# Standard deviation
print("Std price:", df['unit_price'].std())       # 416.15
 
# Variance
print("Var price:", df['unit_price'].var())
 
# Min and max
print("Min price:", df['unit_price'].min())       # 15.0
print("Max price:", df['unit_price'].max())       # 1200.0
 
# Sum
print("Total quantity ordered:", df['quantity'].sum())  # 18
 
# Count non-null values
print("Count:", df['unit_price'].count())         # 10

Part 3: `.value_counts()` — Frequency Counts

.value_counts() counts how often each unique value appears in a column. Essential for exploring categorical data.

python

import pandas as pd
 
df = pd.DataFrame({
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics',
                 'Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
    'unit_price': [1200.0, 25.0, 75.0, 350.0, 89.0,
                   1200.0, 75.0, 25.0, 350.0, 15.0]
})
 
# Count occurrences of each category
print(df['category'].value_counts())
# Accessories    5
# Electronics    5
# Name: category, dtype: int64
 
# Proportions — normalize=True converts counts to fractions
print(df['category'].value_counts(normalize=True))
# Accessories    0.5
# Electronics    0.5
# Name: proportion, dtype: float64
 
# Sort by category name instead of count
print(df['category'].value_counts().sort_index())

Part 4: `.corr()` — Pairwise Correlations

.corr() computes the Pearson correlation coefficient between every pair of numeric columns. Values range from -1 (perfect negative) to +1 (perfect positive); 0 means no linear relationship.

python

import pandas as pd
 
df = pd.DataFrame({
    'quantity':   [1, 2, 1, 1, 1, 2, 3, 1, 2, 4],
    'unit_price': [1200.0, 25.0, 75.0, 350.0, 89.0,
                   1200.0, 75.0, 25.0, 350.0, 15.0],
    'total_price': [1200.0, 50.0, 75.0, 350.0, 89.0,
                    2400.0, 225.0, 25.0, 700.0, 60.0]
})
 
print(df.corr())
#             quantity  unit_price  total_price
# quantity    1.000000   -0.547...    0.673...
# unit_price -0.547...    1.000000    0.970...
# total_price 0.673...    0.970...    1.000000

Interpret: unit_price and total_price have a correlation of ~0.97 — very high, expected since total = quantity × unit_price.

Part 5: `.nunique()` and `.unique()` — Distinct Values

python

import pandas as pd
 
df = pd.DataFrame({
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics'],
    'product':  ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor']
})
 
# Count distinct values per column
print(df.nunique())
# category    2
# product     4
# dtype: int64
 
# Count distinct values in one column
print(df['category'].nunique())  # 2
 
# List distinct values (unordered)
print(df['category'].unique())   # ['Electronics' 'Accessories']
print(df['product'].unique())    # ['Laptop' 'Mouse' 'Keyboard' 'Monitor']

Practice

Key Takeaways

.describe() — one-call summary: count, mean, std, min, quartiles, max for all numeric columns
Column statistics: .mean(), .median(), .std(), .var(), .min(), .max(), .sum(), .count()
.value_counts() — frequency table for categorical columns; normalize=True gives proportions
.corr() — pairwise Pearson correlation matrix; values -1 to +1; 0 = no linear relationship
.nunique() — count of distinct values; .unique() — array of distinct values

Common Mistakes to Avoid

Running .describe() and getting only 2 columns: non-numeric columns are excluded by default; pass include='all' for all dtypes
Using .mean() on skewed data without checking: always compare mean vs median for numeric columns; large gap = skewed data
Interpreting correlation as causation: high .corr() value means linear relationship, not causation

← Previous Lesson | Back to Course Overview | Next Lesson: Inspecting DataFrames: Shape and Info →

Lesson 9: Basic Statistics with pandas

Learning Objectives

Prerequisites

Lesson Outline

Part 1: `.describe()` — Summary Statistics

Part 2: Column-Level Statistics

Part 3: `.value_counts()` — Frequency Counts

Part 4: `.corr()` — Pairwise Correlations

Part 5: `.nunique()` and `.unique()` — Distinct Values

Practice

Key Takeaways

Common Mistakes to Avoid

Concept Map

Try it yourself

Lesson 9: Basic Statistics with pandas

Learning Objectives

Prerequisites

Lesson Outline

Part 1: .describe() — Summary Statistics

Part 2: Column-Level Statistics

Part 3: .value_counts() — Frequency Counts

Part 4: .corr() — Pairwise Correlations

Part 5: .nunique() and .unique() — Distinct Values

Practice

Key Takeaways

Common Mistakes to Avoid

Concept Map

Try it yourself

Part 1: `.describe()` — Summary Statistics

Part 3: `.value_counts()` — Frequency Counts

Part 4: `.corr()` — Pairwise Correlations

Part 5: `.nunique()` and `.unique()` — Distinct Values