Help: Numeric × Numeric — Correlation & Regression

Num × Num — Linear Regression & Correlation

Detailed analysis modal for the relationship between two numeric (continuous) variables, including Pearson and Spearman correlations, simple linear regression, and regression diagnostics.

Overview

When you click a Num × Num cell in the association matrix, Statulator opens a modal with four tabs:

Statistics — Pearson r and Spearman ρ with 95 % confidence intervals and p-values, followed by a full simple linear regression table (intercept and slope estimates, standard errors, t-statistics, p-values, 95 % CIs), model fit metrics (R², adjusted R², F-statistic), and an assumption check for normality of residuals (Shapiro-Francia test).
Graph — Scatter plot with a fitted regression line and 95 % confidence band.
Diagnostics — A four-panel diagnostic display: Residuals vs Fitted, Normal Q-Q plot of residuals, Scale-Location (spread-level) plot, and Residuals vs Leverage.
Interpretation — Auto-generated plain-language summary covering correlation strength, regression significance, assumption violations, and practical recommendations.

You can download the complete analysis (all four tabs) as a PDF report or export the scatter plot as a PNG image using the buttons in the modal footer.

Worked Example

Scenario: Blood Pressure & Age

A researcher has a dataset of 500 adults with Age (years) and SystolicBP (mmHg). They want to know whether older age is associated with higher blood pressure.

Steps in Statulator:

1 Load the CSV on the Dataset Analysis page.

2 Click Select Variables, confirm both variables are detected as Numeric, and save.

3 Click Stat Analysis to generate the 2 × 2 association matrix.

4 Click the cell at the intersection of Age and SystolicBP.

What you will see:

The Statistics tab shows Pearson r = 0.54 (95 % CI: 0.47, 0.60; p < 0.001) and Spearman ρ = 0.51, confirming a moderate positive correlation. The regression table shows SystolicBP = 95.2 + 0.63 × Age, with R² = 0.29 — meaning age explains about 29 % of the variance in systolic blood pressure. The Shapiro-Francia test on residuals may fail (common in large samples), suggesting a note of caution about parametric inference.

The Graph tab displays the scatter plot with the regression line and 95 % confidence band, showing the upward trend clearly. The Diagnostics tab reveals whether residuals are well-behaved (no obvious patterns in the Residuals vs Fitted plot, points close to the diagonal in the Q-Q plot).

Interpretation Guide

Correlation Coefficients

Pearson r measures linear association between two variables, ranging from −1 (perfect negative) to +1 (perfect positive). Spearman ρ measures monotonic association based on ranks and is more robust to outliers and non-linear relationships. If the two values diverge substantially, the relationship may be non-linear.

Common strength labels: |r| < 0.3 = weak, 0.3–0.7 = moderate, > 0.7 = strong.

Regression Table

The intercept (β₀) represents the predicted value of the response variable when the predictor is zero. The slope (β₁) represents the expected change in the response for each one-unit increase in the predictor. Both estimates come with standard errors, t-statistics, and 95 % confidence intervals. A significant p-value for the slope indicates a statistically significant linear relationship.

Model Fit

R² (coefficient of determination) indicates the proportion of variance in the response explained by the predictor. Adjusted R² penalises for the number of predictors (identical to R² in simple regression). The F-statistic tests whether the model fits significantly better than a model with no predictors.

Assumption Check: Shapiro-Francia

This tests whether the regression residuals are normally distributed. A Pass (p > 0.05) means normality is not rejected. A Fail (p ≤ 0.05) suggests the residuals deviate from normality; consider the Spearman correlation as a non-parametric alternative or inspect the diagnostic plots for the nature of the departure.

Diagnostic Plots

Residuals vs Fitted: Look for a random scatter around zero. A curved pattern suggests non-linearity; a funnel shape suggests heteroscedasticity.

Normal Q-Q: Points should lie close to the diagonal. Systematic departures in the tails indicate non-normal residuals (heavy tails, skewness).

Scale-Location: The square root of standardised residuals should show no trend. An upward slope suggests increasing variance (heteroscedasticity).

Residuals vs Leverage: Identifies influential observations. Points with high leverage and large residuals (near Cook’s distance contours) may disproportionately affect the regression line.

Formulas

Pearson Correlation & Confidence Interval

\[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2\,\sum(y_i - \bar{y})^2}} \]

95 % CI via Fisher z-transform:

\[ z = \tfrac{1}{2}\ln\!\left(\frac{1+r}{1-r}\right),\quad \text{SE}_z = \frac{1}{\sqrt{n-3}} \] \[ z_{\text{lower,upper}} = z \pm 1.96\,\text{SE}_z, \quad r_{\text{lower,upper}} = \frac{e^{2z_*}-1}{e^{2z_*}+1} \]

Spearman Rank Correlation

\[ \rho = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)} \]

\(d_i\) = difference in ranks for observation \(i\).

Simple Linear Regression (OLS)

\[ \hat{y} = b_0 + b_1 x, \quad b_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad b_0 = \bar{y} - b_1\bar{x} \] \[ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}, \quad F = \frac{\text{MSR}}{\text{MSE}} \]

Shapiro-Francia Normality Test

\[ W' = \frac{\left(\sum_{i=1}^{n}(x_{(i)} - \bar{x})(m_i - \bar{m})\right)^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2\;\sum_{i=1}^{n}(m_i - \bar{m})^2} \]

\(x_{(i)}\) = \(i\)-th order statistic; \(m_i = \Phi^{-1}\!\bigl((i - 3/8)/(n + 1/4)\bigr)\) = expected normal score (Blom).
\(W'\) is the squared Pearson correlation between the sorted data and the expected normal scores.
P-value via Royston (1993): with \(u = \log n\), \(\nu_1 = \log u - u\), \(\nu_2 = \log u + 2/u\), \(\mu = -1.2725 + 1.0521\,\nu_1\), \(\sigma = 1.0308 - 0.26758\,\nu_2\), \(z = (\log(1 - W') - \mu)/\sigma\), \(p = 1 - \Phi(z)\).
Shapiro-Francia is closely related to Shapiro-Wilk but uses the expected-normal-scores correlation rather than the tabulated covariance-matrix weights of Royston 1992. The two tests agree very closely; SF is preferred here for its simpler implementation.
For \(n > 5000\), Statulator falls back to the D’Agostino-Pearson omnibus test.

Assumptions & Requirements

Pearson correlation: Both variables should be continuous and approximately normally distributed. The relationship should be linear. If these conditions are not met, Spearman ρ is a robust alternative.
Linear regression: Residuals should be normally distributed, homoscedastic (constant variance), and independent. Use the diagnostic plots to verify these assumptions.
Sample size: At least 3 complete pairs are required. For the Shapiro-Francia test, n ≤ 5000 is recommended; for larger samples, Statulator switches to the D’Agostino-Pearson test.
Outliers: Extreme values can heavily influence Pearson r and the regression line. Use the Residuals vs Leverage diagnostic plot to identify influential points.
Missing data: Only observations with values present for both variables are used (pairwise complete cases).

Textbook Examples

Medicine

Relationship between patient age and serum cholesterol in 800 adults.

Results: Pearson r = 0.38 (moderate positive), regression: Cholesterol = 152 + 0.72 × Age, R² = 0.14. Shapiro-Francia: Pass (W = 0.998, p = 0.22).
Interpretation: Age explains about 14 % of cholesterol variation. The relationship is statistically significant but modest in predictive power.

Education

Study hours per week vs final exam score for 200 university students.

Results: Pearson r = 0.63, Spearman ρ = 0.59, R² = 0.40. Residual diagnostics show mild heteroscedasticity in the Scale-Location plot.
Interpretation: Study hours have a strong positive association with exam scores, explaining 40 % of the variance. The slight heteroscedasticity warrants caution but does not invalidate the main conclusion.

Agriculture

Rainfall (mm) vs crop yield (tonnes/ha) across 120 farm plots.

Results: Pearson r = 0.71, R² = 0.50. Residuals vs Fitted plot shows slight curvature, suggesting a possible non-linear component. Spearman ρ = 0.74.
Interpretation: A strong association between rainfall and yield. The higher Spearman ρ and curved residual pattern suggest a non-linear model might fit better.

Social Science

Years of education vs annual income for 1,500 survey respondents.

Results: Pearson r = 0.52, Spearman ρ = 0.56, R² = 0.27. Shapiro-Francia: Fail (large sample), Q-Q plot shows right skew in residuals.
Interpretation: Education is moderately associated with income. The Shapiro-Francia failure in large samples is common and does not necessarily invalidate the regression; the Q-Q plot suggests income might benefit from a log transformation.

References

Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15(1), 72–101.
Fisher, R. A. (1921). On the “probable error” of a coefficient of correlation deduced from a small sample. Metron, 1, 3–32.
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality. Biometrika, 52(3/4), 591–611.
Royston, P. (1995). Remark AS R94: A remark on algorithm AS 181. Applied Statistics, 44(4), 547–551.
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15–18.
Draper, N. R., & Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley.

Back to Stat Analysis Overview | All Help Pages