Help: Univariate Analysis

Univariate Analysis

Summary statistics, frequency tables, and visualisations for individual variables in your dataset.

Overview

Univariate analysis examines each variable in your dataset independently. For numeric variables Statulator produces a summary statistics table (mean, standard deviation, quartiles, minimum, maximum) together with a histogram, Q-Q plot, and box plot. For categorical variables it produces a frequency table (counts and percentages) together with a bar chart and pie chart.

When the data are loaded, Statulator automatically classifies each column as Numeric, Categorical, or ID using simple heuristics: columns with fewer than 6 unique values are treated as categorical; columns where every value is unique are flagged as identifiers (IDs). You can override these defaults in the variable-selection dialog before running the analysis.

Worked Example

Scenario: Hospital Patient Demographics

A hospital quality team has exported a CSV file containing 250 patient records with columns for Age (years), LOS (length of stay, days), Gender (Male/Female), and Department (Emergency, Cardiology, Orthopaedics, General). They want a quick descriptive overview of every variable.

Using Statulator:

1 Open the Dataset Analysis page and click the file-load button to load your CSV. Your data stays in your browser.

2 Click Select Variables. Statulator lists all columns with auto-detected types. Verify that Age and LOS are Numeric, and Gender and Department are Categorical. Adjust if needed, then click Save Changes.

3 Click the UniVariate button.

4 Statulator generates an accordion panel for every selected variable. Expand any panel to see the statistics table and charts.

What you will see:

For Age (numeric): a table showing Mean = 54.2, SD = 16.8, Min = 18, Q1 = 42, Median = 55, Q3 = 67, Max = 93, n = 247, Missing = 3; a histogram of the age distribution; a Q-Q plot for normality assessment; and a box plot highlighting the quartiles and any outliers.

For Department (categorical): a frequency table listing each department with its count and percentage (e.g., Emergency 82, 32.8 %); a bar chart of frequencies; and a pie chart of the proportional breakdown.

Interpretation Guide

Summary Statistics Table

The mean and standard deviation describe the centre and spread of numeric data. Compare the mean to the median: if they are close the distribution is roughly symmetric; a large gap suggests skewness. Q1 and Q3 define the interquartile range (IQR = Q3 − Q1), the middle 50 % of values.

Histogram

The histogram shows the frequency distribution. Look for overall shape (symmetric, left-skewed, right-skewed), the number of modes (unimodal, bimodal), and any gaps or outlying bars far from the main mass.

Q-Q Plot

The quantile-quantile plot compares your data to a theoretical normal distribution. Points lying close to the diagonal reference line indicate approximate normality. Systematic deviations at the tails (S-shape or banana curve) reveal skewness or heavy tails.

Box Plot

The box spans Q1 to Q3 with the median line inside. Whiskers extend to the most extreme data points within 1.5 × IQR of the box. Individual dots beyond the whiskers are potential outliers that warrant investigation.

Frequency Table & Charts (Categorical)

The frequency table shows how observations are distributed across categories. Check for highly imbalanced groups (e.g., one category containing > 80 % of the data) which may affect downstream analyses. The bar chart makes magnitudes easy to compare; the pie chart highlights relative proportions.

Formulas

Sample Mean

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \]

Sample Standard Deviation

\[ s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2} \]

Statulator uses the sample (Bessel-corrected) standard deviation with denominator $n-1$.

Quartiles

Sort the $n$ observations in ascending order. Let $L_p = (n-1) \cdot p$ for the $p$-th percentile.

\[ Q_p = x_{\lfloor L_p \rfloor} + (L_p - \lfloor L_p \rfloor)(x_{\lfloor L_p \rfloor + 1} - x_{\lfloor L_p \rfloor}) \]

$Q_1$: set $p = 0.25$; $Q_2$ (median): $p = 0.50$; $Q_3$: $p = 0.75$
This is the linear-interpolation method (Hyndman & Fan type 7), matching R’s quantile() default, NumPy’s numpy.quantile default, and Excel’s PERCENTILE.INC.

Interquartile Range

\[ \text{IQR} = Q_3 - Q_1 \]

Outlier fences: lower = $Q_1 - 1.5\,\text{IQR}$, upper = $Q_3 + 1.5\,\text{IQR}$.

Relative Frequency (Categorical)

\[ p_j = \frac{n_j}{N} \times 100\% \]

$n_j$ = count in category $j$, $N$ = total non-missing observations.

Assumptions & Requirements

Data format: The file must be a comma-separated values (CSV) file. If your data is in Excel, save it as CSV first.
Variable types: Numeric columns should contain only numbers (and optionally missing values). Categorical columns can contain text or numeric codes.
Missing data: Statulator automatically excludes missing values from each calculation and reports the count of missing observations.
Normality (for interpretation): The Q-Q plot and box plot help you assess normality, which is an assumption of many subsequent parametric tests. Univariate analysis itself does not assume normality.
Sample size: There is no strict minimum, but histograms and Q-Q plots become more informative with at least 20–30 observations.

Textbook Examples

Medicine

Emergency department waiting times (minutes) for 150 patients: mean = 42, median = 35, SD = 22, IQR = 25–52, skewness = 1.4.

Analysis: The right skew (mean > median) and skewness of 1.4 indicate a long tail of very long waits. The histogram shows a peak near 30 min with a right tail. Report the median and IQR as the primary summary, since the distribution is not symmetric.

Education

Final exam scores (0–100) for 200 students: mean = 72.4, median = 74, SD = 11.3, min = 28, max = 98.

Analysis: Mean and median are close, suggesting approximate symmetry. The Q-Q plot is roughly linear. The box plot shows two low outliers below 35. These students may need targeted support.

Social Science

Annual household income ($) for 500 respondents: mean = $68,200, median = $52,000, SD = $45,000, IQR = $32k–$78k.

Analysis: The large gap between mean and median signals strong positive skew, typical of income distributions. A histogram would show a long right tail. The median ($52k) is the better central-tendency measure here.

Engineering

Battery life (hours) for 60 smartphones: mean = 11.2, median = 11.0, SD = 1.8, range = 7.1–15.6.

Analysis: Near-symmetric (mean ≈ median). The Q-Q plot and Shapiro-Francia test (p = 0.34) support normality. Reporting mean ± SD (11.2 ± 1.8 hours) is appropriate.

Agriculture

Categorical: crop disease status across 300 plants. Healthy = 210 (70%), Mild = 55 (18.3%), Severe = 35 (11.7%).

Analysis: A frequency table and bar chart show that the majority of plants are healthy. A pie chart illustrates the proportions. The modal category is "Healthy."

References

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Chambers, J. M., Cleveland, W. S., Kleiner, B., & Tukey, P. A. (1983). Graphical Methods for Data Analysis. Wadsworth.
Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika, 55(1), 1–17.
McGill, R., Tukey, J. W., & Larsen, W. A. (1978). Variations of box plots. The American Statistician, 32(1), 12–16.
Agresti, A. (2018). Statistical Methods for the Social Sciences (5th ed.). Pearson.

Back to Dataset Analysis | All Help Pages