Sample Size for Comparing Two Independent Means

A guide to determining the sample size required to detect a meaningful difference between two group means.

Overview

This calculator determines the number of subjects needed in each group to detect a specified difference between two independent population means with a given level of statistical power. It is used when planning a two-arm study such as a randomised controlled trial, a cohort study, or any comparison of two independent groups.

The calculation balances four competing factors: the significance level (α), the desired power (1 − β), the expected variability (σ), and the minimum clinically important difference (δ). Making any of these more stringent increases the required sample size.

The calculator supports three hypothesis frameworks: equality (standard two-sided test), non-inferiority/superiority (one-sided with a margin), and equivalence (two one-sided tests, TOST).

Worked Example

Scenario: Comparing Test Scores Between Two Teaching Methods

An education researcher plans to compare the mean exam score between students taught with a new interactive method versus the traditional lecture method. Based on pilot data, the common standard deviation is σ = 12 points. The researcher considers a difference of 5 points to be educationally meaningful and wants 80% power at a 5% significance level (two-sided) with equal allocation (1:1 ratio).

Using Statulator step-by-step:

1 Open the Sample Size Calculator for Comparing Two Independent Means.

2 Under Hypothesis, select Equality (the default).

3 Set Significance Level (α) to 0.05 and Power (1 − β) to 0.80.

4 Enter the Standard Deviation (σ) as 12.

5 Enter the Mean Difference as 5.

6 Keep the Allocation Ratio (r) at 1 (equal groups).

7 The calculator shows the required sample size per group. The result should be approximately n = 91 per group (182 total).

Hand calculation verification:
\[ n = \frac{(r + 1)}{r} \cdot \frac{(z_{\alpha/2} + z_{\beta})^{2} \cdot \sigma^{2}}{\delta^{2}} = \frac{2}{1} \cdot \frac{(1.96 + 0.842)^{2} \times 144}{25} = 2 \times \frac{7.849 \times 144}{25} \approx 90.5 \]

Rounding up: n = 91 per group, 182 total.

Non-Inferiority Example

If the researcher instead wanted to show that the new method is not worse than the traditional method by more than 3 points (non-inferiority margin δ = −3), with the same parameters and a true mean difference of 0:

\[ n = \frac{(r + 1)}{r} \cdot \frac{(z_{\alpha} + z_{\beta})^{2} \cdot \sigma^{2}}{(\delta_{0} - \delta_{NI})^{2}} = 2 \times \frac{(1.645 + 0.842)^{2} \times 144}{(0 - (-3))^{2}} = 2 \times \frac{6.18 \times 144}{9} \approx 197.7 \]

Rounding up: n = 198 per group.

Interpretation Guide

OutputInterpretation
Sample Size per Group The minimum number of subjects needed in each group. With unequal allocation (r ≠ 1), the two groups will have different sizes: Group 1 = n, Group 2 = n × r.
Total Sample Size The combined number across both groups: n × (1 + r).
Live Interpretation A plain-language summary of what the sample size achieves, including the detectable difference, power, and significance level.
Visualisation Shows how sample size changes across a range of effect sizes or standard deviations, helping you assess sensitivity to uncertain inputs.

Key considerations: If the allocation ratio is not 1:1, the total sample size increases for the same power. For example, a 2:1 ratio needs a larger total n than 1:1. Unequal allocation is sometimes needed for ethical or practical reasons (e.g., giving more participants the active treatment).

Formula

Equality (Two-Sided Test)
\[ n = \frac{(r + 1)}{r} \cdot \frac{(z_{\alpha/2} + z_{\beta})^{2} \cdot \sigma^{2}}{\delta^{2}} \]
Non-Inferiority / Superiority (One-Sided)
\[ n = \frac{(r + 1)}{r} \cdot \frac{(z_{\alpha} + z_{\beta})^{2} \cdot \sigma^{2}}{(\delta_{0} - \delta_{m})^{2}} \]

where \( \delta_m \) is the non-inferiority or superiority margin, and \( \delta_0 \) is the assumed true difference.

Equivalence (TOST)
\[ n = \frac{(r + 1)}{r} \cdot \frac{(z_{\alpha} + z_{\beta/2})^{2} \cdot \sigma^{2}}{(|\delta_{m}| - |\delta_{0}|)^{2}} \]

For equivalence testing (two one-sided tests, TOST), the type II error is split between the two one-sided tests, so the critical value is \( z_{\beta/2} \) rather than \( z_{\beta} \).

where:

t-Distribution Adjustment

Replace \( z \)-values with \( t \)-values and iterate:

\[ n_{t} = \frac{(r + 1)}{r} \cdot \frac{(t_{\alpha/2,\,\nu} + t_{\beta,\,\nu})^{2} \cdot \sigma^{2}}{\delta^{2}} \quad \text{where} \quad \nu = (r + 1) \cdot n_{t} - 2 \]
Cluster Sampling Adjustment
\[ n_{\text{cluster}} = n \times \text{DEFF} \quad \text{where} \quad \text{DEFF} = 1 + (m - 1) \cdot \rho \]

Assumptions & Requirements

Textbook Examples

Medicine

A clinical trial tests whether a new drug reduces systolic blood pressure more than a placebo. The clinically meaningful difference is 8 mmHg.

Inputs: Difference = 8 mmHg, SD = 15 mmHg (both groups), α = 0.05 (two-sided), power = 80%, allocation 1:1.
Result: n = 57 per group (114 total).
Interpretation: Enrolling 57 patients per arm provides 80% power to detect an 8 mmHg difference between treatments.

Education

Researchers compare mean exam scores between students using adaptive-learning software and traditional instruction.

Inputs: Difference = 5 points, SD = 12 points, α = 0.05 (two-sided), power = 90%.
Result: n = 122 per group (244 total).
Interpretation: Each group needs 122 students to detect a 5-point improvement with 90% power.

Engineering

An automotive lab compares fuel efficiency (km/L) between two engine designs.

Inputs: Difference = 1.5 km/L, SD = 2.8 km/L, α = 0.05 (two-sided), power = 80%.
Result: n = 56 per group (112 total).
Interpretation: Testing 56 vehicles with each engine design gives 80% power to detect a 1.5 km/L difference.

Agriculture

A field trial compares mean grain yield (kg/ha) between a new fertilizer and the current standard.

Inputs: Difference = 200 kg/ha, SD = 400 kg/ha, α = 0.05 (two-sided), power = 80%.
Result: n = 64 per group (128 total).
Interpretation: Allocating 64 plots to each treatment provides 80% power to detect a 200 kg/ha yield improvement.

References

  1. Chow, S.-C., Shao, J., Wang, H., & Lokhnygina, Y. (2018). Sample Size Calculations in Clinical Research (3rd ed.). Chapman & Hall/CRC., Chapters 3–4: two-group designs for means.
  2. Julious, S. A. (2010). Sample Sizes for Clinical Trials. Chapman & Hall/CRC., Chapter 5: equality, non-inferiority, and equivalence designs.
  3. Machin, D., Campbell, M. J., Tan, S. B., & Tan, S. H. (2009). Sample Size Tables for Clinical Studies (3rd ed.). Wiley-Blackwell.
  4. Wang, H., & Chow, S.-C. (2007). Sample size calculation for comparing two means. Encyclopedia of Clinical Trials. Wiley.
  5. Kish, L. (1965). Survey Sampling. John Wiley & Sons., Design effect for cluster sampling.