Bivariate Analysis

Explore relationships between pairs of variables with scatter plots, grouped box plots, and contingency tables.

Overview

Bivariate analysis examines how two variables relate to each other. In Statulator you first designate one variable as the response (outcome) using the variable-selection dialog, then click BiVariate. Statulator pairs the response variable with every other selected variable and automatically chooses the appropriate visualisation and summary based on variable types:

Worked Example

Scenario: Student Performance Dataset

An education researcher has a CSV with 200 student records containing FinalScore (numeric, 0–100), StudyHours (numeric), Gender (Male/Female), and Major (Science, Arts, Business). They want to explore how FinalScore relates to each other variable.

Using Statulator:

1 Open Dataset Analysis and load the CSV.

2 Click Select Variables. Mark FinalScore as the Response variable (tick the Response checkbox). Ensure all other columns are selected with correct types. Click Save Changes.

3 Click the BiVariate button.

4 Statulator creates an accordion panel for each predictor paired with FinalScore.

What you will see:

FinalScore × StudyHours (Num × Num): A scatter plot with a linear trend line and the Pearson r displayed (e.g., r = 0.62). Points above the line represent students who outperformed the trend.

FinalScore × Gender (Num × Cat): Side-by-side box plots for Male and Female, plus a stratified summary table showing each group’s mean, SD, median, Q1, Q3, and n.

FinalScore × Major (Num × Cat with 3 groups): Box plots for Science, Arts, and Business, making it easy to compare distributions across departments.

Interpretation Guide

Scatter Plot (Numeric × Numeric)

Look for the overall direction (positive or negative trend), strength (how tightly points cluster around the trend line), and form (linear or curved). The Pearson correlation r ranges from −1 to +1: values near ±1 indicate a strong linear relationship; values near 0 suggest weak or no linear association. Watch for outliers that may inflate or deflate r.

Grouped Box Plots (Numeric × Categorical)

Compare the centre (median line) and spread (box height) across groups. Non-overlapping boxes suggest a meaningful group difference. If one group has a visibly higher median with a narrow box while another is wide and lower, the groups likely differ in both location and variability. Outlier dots in specific groups may warrant further investigation.

Contingency Table (Categorical × Categorical)

Each cell shows the raw count, the row percentage (count divided by the row total), and the column percentage (count divided by the column total). If the row percentages are nearly identical across rows the two variables are largely independent; large differences suggest an association. The column percentages provide the same view from the other variable’s perspective. Row and column totals help identify imbalanced groups that might affect interpretation.

Group Summary Statistics

The stratified table for Numeric × Categorical includes the standard error (SE) for each group, which reflects the precision of the group mean. Smaller SE values (relative to the difference in means) suggest the observed difference is likely reliable. The overall row provides the combined statistics for reference.

Formulas

Pearson Correlation Coefficient
\[ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \;\sum_{i=1}^{n}(y_i - \bar{y})^2}} \]
  • \(x_i, y_i\) = paired observations;   \(\bar{x}, \bar{y}\) = sample means.
  • Range: \(-1 \leq r \leq 1\).
Standard Error of the Group Mean
\[ \text{SE}_j = \frac{s_j}{\sqrt{n_j}} \]
  • \(s_j\) = sample SD in group \(j\);   \(n_j\) = number of observations in group \(j\).
Cross-Tabulation Percentages

Row percentage:

\[ \text{Row}\%_{ij} = \frac{n_{ij}}{n_{i\cdot}} \times 100\% \]

Column percentage:

\[ \text{Col}\%_{ij} = \frac{n_{ij}}{n_{\cdot j}} \times 100\% \]
  • \(n_{ij}\) = count in row \(i\), column \(j\);   \(n_{i\cdot}\) = row total;   \(n_{\cdot j}\) = column total.
Outlier Detection (Box Plot)
\[ \text{Lower fence} = Q_1 - 1.5\,\text{IQR}, \quad \text{Upper fence} = Q_3 + 1.5\,\text{IQR} \]

Any observation below the lower fence or above the upper fence is flagged as an outlier.

Assumptions & Requirements

Textbook Examples

Medicine

Scatter plot of age vs. systolic blood pressure for 120 patients: Pearson r = 0.58, p < 0.001.

Analysis: The scatter plot shows a clear upward trend. The moderate positive correlation (r = 0.58) confirms that blood pressure tends to increase with age. Outlier patients (young with high BP) are visible and may warrant clinical review.

Education

Grouped box plot of exam scores by study method (Self-study, Tutor, Online) for 180 students.

Analysis: The Tutor group has the highest median (78) and smallest spread (IQR = 12). The Self-study group shows more variability (IQR = 20) and two low outliers. Visual comparison suggests the Tutor method produces more consistent results.

Social Science

Contingency table of employment status (Employed/Unemployed) by education level (High School/Bachelor/Postgrad) for 400 adults.

Analysis: The row-percentage table shows employment rates of 65%, 82%, and 93% for HS, Bachelor's, and Postgrad respectively. A stacked bar chart clearly illustrates the positive association between education level and employment.

Agriculture

Scatter plot of rainfall (mm) vs. crop yield (kg/ha) across 50 district-level observations: r = 0.72.

Analysis: Strong positive correlation. The scatter plot shows a roughly linear trend with increased scatter at higher rainfall levels (potential heteroscedasticity). A Spearman correlation (0.68) confirms the monotonic relationship is robust.

References

  1. Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.
  2. Cleveland, W. S. (1993). Visualizing Data. Hobart Press.
  3. Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley.
  4. Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). W. W. Norton.
  5. Tufte, E. R. (2001). The Visual Display of Quantitative Information (2nd ed.). Graphics Press.