Help: Bivariate Analysis

Bivariate Analysis

Explore relationships between pairs of variables with scatter plots, grouped box plots, and contingency tables.

Overview

Bivariate analysis examines how two variables relate to each other. In Statulator you first designate one variable as the response (outcome) using the variable-selection dialog, then click BiVariate. Statulator pairs the response variable with every other selected variable and automatically chooses the appropriate visualisation and summary based on variable types:

Numeric × Numeric: Scatter plot with trend line and Pearson correlation coefficient.
Numeric × Categorical: Grouped box plots and group-stratified summary statistics (mean, SE, SD, quartiles, min, max per category).
Categorical × Categorical: Contingency table (cross-tabulation) with cell counts, row percentages and column percentages displayed in every cell, plus row totals, column totals, and a grand total.

Worked Example

Scenario: Student Performance Dataset

An education researcher has a CSV with 200 student records containing FinalScore (numeric, 0–100), StudyHours (numeric), Gender (Male/Female), and Major (Science, Arts, Business). They want to explore how FinalScore relates to each other variable.

Using Statulator:

1 Open Dataset Analysis and load the CSV.

2 Click Select Variables. Mark FinalScore as the Response variable (tick the Response checkbox). Ensure all other columns are selected with correct types. Click Save Changes.

3 Click the BiVariate button.

4 Statulator creates an accordion panel for each predictor paired with FinalScore.

What you will see:

FinalScore × StudyHours (Num × Num): A scatter plot with a linear trend line and the Pearson r displayed (e.g., r = 0.62). Points above the line represent students who outperformed the trend.

FinalScore × Gender (Num × Cat): Side-by-side box plots for Male and Female, plus a stratified summary table showing each group’s mean, SD, median, Q1, Q3, and n.

FinalScore × Major (Num × Cat with 3 groups): Box plots for Science, Arts, and Business, making it easy to compare distributions across departments.

Interpretation Guide

Scatter Plot (Numeric × Numeric)

Look for the overall direction (positive or negative trend), strength (how tightly points cluster around the trend line), and form (linear or curved). The Pearson correlation r ranges from −1 to +1: values near ±1 indicate a strong linear relationship; values near 0 suggest weak or no linear association. Watch for outliers that may inflate or deflate r.

Grouped Box Plots (Numeric × Categorical)

Compare the centre (median line) and spread (box height) across groups. Non-overlapping boxes suggest a meaningful group difference. If one group has a visibly higher median with a narrow box while another is wide and lower, the groups likely differ in both location and variability. Outlier dots in specific groups may warrant further investigation.

Contingency Table (Categorical × Categorical)

Each cell shows the raw count, the row percentage (count divided by the row total), and the column percentage (count divided by the column total). If the row percentages are nearly identical across rows the two variables are largely independent; large differences suggest an association. The column percentages provide the same view from the other variable’s perspective. Row and column totals help identify imbalanced groups that might affect interpretation.

Group Summary Statistics

The stratified table for Numeric × Categorical includes the standard error (SE) for each group, which reflects the precision of the group mean. Smaller SE values (relative to the difference in means) suggest the observed difference is likely reliable. The overall row provides the combined statistics for reference.

Formulas

Pearson Correlation Coefficient

\[ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \;\sum_{i=1}^{n}(y_i - \bar{y})^2}} \]

\(x_i, y_i\) = paired observations; \(\bar{x}, \bar{y}\) = sample means.
Range: \(-1 \leq r \leq 1\).

Standard Error of the Group Mean

\[ \text{SE}_j = \frac{s_j}{\sqrt{n_j}} \]

\(s_j\) = sample SD in group \(j\); \(n_j\) = number of observations in group \(j\).

Cross-Tabulation Percentages

Row percentage:

\[ \text{Row}\%_{ij} = \frac{n_{ij}}{n_{i\cdot}} \times 100\% \]

Column percentage:

\[ \text{Col}\%_{ij} = \frac{n_{ij}}{n_{\cdot j}} \times 100\% \]

\(n_{ij}\) = count in row \(i\), column \(j\); \(n_{i\cdot}\) = row total; \(n_{\cdot j}\) = column total.

Outlier Detection (Box Plot)

\[ \text{Lower fence} = Q_1 - 1.5\,\text{IQR}, \quad \text{Upper fence} = Q_3 + 1.5\,\text{IQR} \]

Any observation below the lower fence or above the upper fence is flagged as an outlier.

Assumptions & Requirements

Response variable: You must select exactly one response variable in the variable-selection dialog before clicking BiVariate. This determines which variable appears on the y-axis or as the outcome in each pair.
Variable types: Ensure variables are correctly typed (Numeric or Categorical). Mis-classified variables produce inappropriate visualisations.
Linearity (Num × Num): The Pearson correlation and trend line assume a linear relationship. Inspect the scatter plot for curved patterns; if present, a non-linear model may be more appropriate.
Group sizes (Num × Cat): Grouped box plots and summary statistics are most informative when each category has at least 5–10 observations. Very small groups produce unreliable estimates.
Sparse cells (Cat × Cat): Cross-tabulations with many zero or near-zero cells can be misleading. Consider collapsing rare categories before analysis.
Missing data: Pairs with a missing value in either variable are excluded from that bivariate analysis.

Textbook Examples

Medicine

Scatter plot of age vs. systolic blood pressure for 120 patients: Pearson r = 0.58, p < 0.001.

Analysis: The scatter plot shows a clear upward trend. The moderate positive correlation (r = 0.58) confirms that blood pressure tends to increase with age. Outlier patients (young with high BP) are visible and may warrant clinical review.

Education

Grouped box plot of exam scores by study method (Self-study, Tutor, Online) for 180 students.

Analysis: The Tutor group has the highest median (78) and smallest spread (IQR = 12). The Self-study group shows more variability (IQR = 20) and two low outliers. Visual comparison suggests the Tutor method produces more consistent results.

Social Science

Contingency table of employment status (Employed/Unemployed) by education level (High School/Bachelor/Postgrad) for 400 adults.

Analysis: The row-percentage table shows employment rates of 65%, 82%, and 93% for HS, Bachelor's, and Postgrad respectively. A stacked bar chart clearly illustrates the positive association between education level and employment.

Agriculture

Scatter plot of rainfall (mm) vs. crop yield (kg/ha) across 50 district-level observations: r = 0.72.

Analysis: Strong positive correlation. The scatter plot shows a roughly linear trend with increased scatter at higher rainfall levels (potential heteroscedasticity). A Spearman correlation (0.68) confirms the monotonic relationship is robust.

References

Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.
Cleveland, W. S. (1993). Visualizing Data. Hobart Press.
Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley.
Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). W. W. Norton.
Tufte, E. R. (2001). The Visual Display of Quantitative Information (2nd ed.). Graphics Press.

Back to Dataset Analysis | All Help Pages