Introduction to Statistics
Statistics is the science of collecting, organizing, analyzing, and interpreting data. In our data-driven world, statistical literacy is an essential skill—from understanding medical research to making informed decisions in everyday life.
Descriptive statistics summarizes data using numbers and visualizations, while inferential statistics uses sample data to make predictions about larger populations. This article focuses on descriptive statistics.
Types of Data
Categorical vs. Numerical Data
Categorical (Qualitative) data represents characteristics or qualities:
- Nominal: Categories with no natural order (colors, gender, eye color)
- Ordinal: Categories with a meaningful order (rating scales, education level)
Numerical (Quantitative) data represents numbers:
- Discrete: Countable values (number of students, dice rolls)
- Continuous: Any value in a range (height, temperature, time)
Measures of Central Tendency
Central tendency describes the center or typical value of a dataset. The three main measures are mean, median, and mode.
Mean (Average)
The mean is the sum of all values divided by the count of values.
where Σx is the sum of all values and n is the number of values.
Example 1: Calculating Mean
Find the mean of: 4, 8, 6, 5, 3, 8, 9
Sum = 4 + 8 + 6 + 5 + 3 + 8 + 9 = 43
n = 7 values
Mean = 43 / 7 = 6.14
Weighted Mean: When values have different importance:
where w is the weight for each value.
Median
The median is the middle value when data is arranged in order.
Example 2: Finding the Median
Data: 3, 7, 2, 9, 5, 8, 1
Step 1: Arrange in order: 1, 2, 3, 5, 7, 8, 9
Step 2: Find the middle: 7 values, so the 4th is middle
Median = 5
Example 3: Median with Even Number of Values
Data: 3, 7, 2, 9, 5, 8
Step 1: Arrange: 2, 3, 5, 7, 8, 9
Step 2: Average of two middle values: (5 + 7) / 2
Median = 6
Mode
The mode is the most frequently occurring value(s) in a dataset.
Example 4: Finding the Mode
Data: 3, 5, 7, 5, 2, 5, 9, 5
Count: 3(1), 5(4), 7(1), 2(1), 9(1)
Mode = 5 (appears 4 times)
A dataset can have:
- No mode: All values appear equally
- One mode (unimodal): One value appears most
- Two modes (bimodal): Two values appear most
- Many modes (multimodal): Multiple values appear most
Choosing the Right Measure
- Mean: Best for symmetric distributions without outliers
- Median: Better when data has outliers or is skewed
- Mode: Useful for categorical data or identifying most common value
Measures of Spread (Variation)
Central tendency alone doesn't fully describe data. Spread tells us how varied the data is.
Range
The simplest measure of spread:
Data: 3, 7, 2, 9, 5, 8
Range = 9 - 2 = 7
Variance and Standard Deviation
These measure how far values typically are from the mean.
Population Variance (σ²)
where μ is the population mean and N is the population size.
Sample Variance (s²)
where x̄ is the sample mean and n is the sample size.
Note: We use (n-1) for sample variance to get an unbiased estimate.
Standard Deviation
Standard deviation is in the same units as the original data.
Example 5: Calculating Variance and Standard Deviation
Data: 4, 8, 6, 5, 3 (n = 5)
Step 1: Mean = (4 + 8 + 6 + 5 + 3) / 5 = 26/5 = 5.2
Step 2: Find deviations from mean:
(4-5.2) = -1.2, (8-5.2) = 2.8, (6-5.2) = 0.8, (5-5.2) = -0.2, (3-5.2) = -2.2
Step 3: Square deviations:
1.44, 7.84, 0.64, 0.04, 4.84
Step 4: Sum of squared deviations = 14.8
Step 5: Variance = 14.8 / 4 = 3.7 (using n-1)
Step 6: Standard deviation = √3.7 ≈ 1.92
The Empirical Rule (68-95-99.7 Rule)
For normally distributed data:
- ~68% of data falls within 1 standard deviation of the mean
- ~95% of data falls within 2 standard deviations
- ~99.7% of data falls within 3 standard deviations
Quartiles and the Interquartile Range
Quartiles divide data into four equal parts:
- Q1 (First quartile): 25th percentile
- Q2 (Second quartile): 50th percentile (same as median)
- Q3 (Third quartile): 75th percentile
Interquartile Range (IQR)
IQR represents the middle 50% of data and is resistant to outliers.
Example 6: Finding Quartiles
Data: 2, 4, 5, 6, 7, 8, 9, 12
Q2 (median): (6 + 7) / 2 = 6.5
Q1: Median of lower half (2, 4, 5, 6) = (4 + 5) / 2 = 4.5
Q3: Median of upper half (7, 8, 9, 12) = (8 + 9) / 2 = 8.5
IQR: 8.5 - 4.5 = 4
Box Plots
A box plot visually displays the five-number summary:
- Minimum
- Q1
- Median (Q2)
- Q3
- Maximum
Probability Basics
Probability measures how likely an event is to occur, on a scale from 0 (impossible) to 1 (certain).
Basic Probability Formula
All outcomes must be equally likely and exhaustive.
Example 7: Basic Probability
What is the probability of rolling an even number on a fair die?
Favorable outcomes: 2, 4, 6 (3 outcomes)
Total outcomes: 6
P(even) = 3/6 = 1/2 or 0.5 or 50%
Probability Rules
Complement Rule:
Addition Rule (for mutually exclusive events):
Multiplication Rule (for independent events):
Example 8: Probability Rules
A bag has 3 red and 7 blue balls. What is P(not red)?
P(red) = 3/10
P(not red) = 1 - 3/10 = 7/10
Expected Value
The expected value is the long-run average of a random variable:
Sum of each outcome multiplied by its probability.
Example 9: Expected Value
A game costs $5 to play. You win $20 with probability 0.2 and nothing otherwise. What is your expected gain?
E(gain) = ($20 - $5)(0.2) + (-$5)(0.8) = $3 - $4 = -$1
On average, you lose $1 per game.
Key Takeaways
- Mean = sum/n; best for symmetric data
- Median = middle value; better for skewed data
- Mode = most frequent value; useful for categorical data
- Standard deviation measures spread from the mean
- IQR (Q3 - Q1) measures spread of middle 50%
- Probability ranges from 0 to 1
- Expected value is the long-run average
Practice Statistics
Test your statistics knowledge with our interactive practice tests.
Take Statistics Test →