Cumulative differences between paired samples

Isabel Kloumann; Hannah Korevaar; Chris McConnell; Mark Tygert; Jessica Zhao

Cumulative differences between paired samples

Isabel Kloumann, Hannah Korevaar, Chris McConnell, Mark Tygert, Jessica Zhao

TL;DR

The paper tackles detecting differences between two paired populations conditioned on an ordinal covariate by introducing a fully nonparametric cumulative framework. It builds graphs of cumulative weighted differences $C_k$ versus abscissae $A_k$ and uses the Kuiper metric $D$ to summarize overall differences, avoiding binning and model-related biases. The authors show that this approach outperforms traditional reliability diagrams and extends naturally to multiple covariates via Hilbert space-filling curves, with statistical significance assessed through a driftless random-walk null and an estimator $\sigma^2$. Applications to synthetic data, the KDD Cup 1998 donor data, and the American Community Survey demonstrate the method’s ability to reveal structured, covariate-specific differences and provide interpretable, robust metrics.

Abstract

The simplest, most common paired samples consist of observations from two populations, with each observed response from one population corresponding to an observed response from the other population at the same value of an ordinal covariate. The pair of observed responses (one from each population) at the same value of the covariate is known as a "matched pair" (with the matching based on the value of the covariate). A graph of cumulative differences between the two populations reveals differences in responses as a function of the covariate. Indeed, the slope of the secant line connecting two points on the graph becomes the average difference over the wide interval of values of the covariate between the two points; i.e., slope of the graph is the average difference in responses. ("Average" refers to the weighted average if the samples are weighted.) Moreover, a simple statistic known as the Kuiper metric summarizes into a single scalar the overall differences over all values of the covariate. The Kuiper metric is the absolute value of the total difference in responses between the two populations, totaled over the interval of values of the covariate for which the absolute value of the total is greatest. The total should be normalized such that it becomes the (weighted) average over all values of the covariate when the interval over which the total is taken is the entire range of the covariate (i.e., the sum for the total gets divided by the total number of observations, if the samples are unweighted, or divided by the total weight, if the samples are weighted). This cumulative approach is fully nonparametric and uniquely defined (with only one right way to construct the graphs and scalar summary statistics), unlike traditional methods such as reliability diagrams or parametric or semi-parametric regressions, which typically obscure significant differences due to their parameter settings.

Cumulative differences between paired samples

TL;DR

versus abscissae

and uses the Kuiper metric

to summarize overall differences, avoiding binning and model-related biases. The authors show that this approach outperforms traditional reliability diagrams and extends naturally to multiple covariates via Hilbert space-filling curves, with statistical significance assessed through a driftless random-walk null and an estimator

. Applications to synthetic data, the KDD Cup 1998 donor data, and the American Community Survey demonstrate the method’s ability to reveal structured, covariate-specific differences and provide interpretable, robust metrics.

Abstract

Paper Structure (13 sections, 15 equations, 9 figures)

This paper contains 13 sections, 15 equations, 9 figures.

Introduction
Methods
Notation
Graphs
Scalar metrics
Statistical significance
Review of reliability diagrams
Results and discussion
Synthetic examples
KDD Cup 1998
American Community Survey
Conclusion
Review of space-filling curves

Figures (9)

Figure 1: $m =$ 1,000, $n =$ 4,000; Kuiper's statistic is $0.09816 / \sigma = 7.784$, Kolmogorov's and Smirnov's is $0.09403 / \sigma = 7.456$. The cumulative graphs clearly reveal a narrow range of scores (right around the median score) that is flat, that is, where there is little difference in the responses between the populations. In contrast, discerning the lack of difference in responses between the populations in this narrow range is very challenging using only the reliability diagrams --- the cumulative graphs are far clearer. The weighted average difference in responses is the vertical coordinate $C_m$ at the greatest (rightmost) score $S_m$ in the cumulative plots, which is clearly much smaller (closer to 0) than the full vertical range of the graph; the vertical range ($\max_{0 \le j \le m} C_j - \min_{0 \le j \le m} C_j$) is the value of the Kuiper metric. Overall, the empirical cumulative graph closely matches the exact expected ground-truth cumulative graph.
Figure 2: $m =$ 4,000, $n =$ 4,000; Kuiper's statistic is $0.1686 / \sigma = 16.20$, Kolmogorov's and Smirnov's is $0.1571 / \sigma = 15.09$. The cumulative graphs have fairly clear kinks at the scores where the reliability diagrams ideally would jump discontinuously in order to match the lowermost plot. However, detecting discontinuous jumps in the reliability diagrams based on the actual random observations is very hard. When the bins are narrow enough to resolve the jumps, the noise on the weighted average response in each bin is large, with the averages jumping all over due to the noise (in addition to the jumps in the underlying expected distribution depicted in the lowermost plot). For the most part, the cumulative graph based on random observations closely matches the exact expected ground-truth cumulative graph.
Figure 3: $m =$ 400, $n =$ 4,000; Kuiper's statistic is $0.04235 / \sigma = 3.676$, Kolmogorov's and Smirnov's is $0.02328 / \sigma = 2.021$. Recall that the weighted average difference in responses is the vertical coordinate $C_m$ at the greatest (rightmost) score $S_m$ in the cumulative plots. This value is significantly smaller, that is, nearer to 0, than the vertical range of the graph; the vertical range ($\max_{0 \le j \le m} C_j - \min_{0 \le j \le m} C_j$) is the value of the Kuiper statistic. Whereas any one of the empirical reliability diagrams obscures at least some aspect of the exact ground-truth diagram displayed in the lowermost plot, the empirical cumulative graph closely resembles the exact expected ground-truth cumulative graph.
Figure 4: $m = n =$ 63,826; Kuiper's statistic is $0.01354 / \sigma = 2.514$, Kolmogorov's and Smirnov's is $0.01245 / \sigma = 2.312$. Notice how these scalar statistics miss the very steep slopes in the cumulative graph, as the graph oscillates rapidly. The following figure, Figure \ref{['ex20']}, fixes this failure by reversing the order of the covariates in the parameterization via the Hilbert curve. The present figure and the following figure display similarly high slopes, but re-ordering the scores as in the following figure prevents the oscillation that the scalar metrics cannot capture.
Figure 5: $m = n =$ 63,826; Kuiper's statistic is $0.02246 / \sigma = 4.170$, Kolmogorov's and Smirnov's is $0.01813 / \sigma = 3.366$. This figure is the same as the previous figure, Figure \ref{['ex02']}, but with the order of the covariates reversed in the parameterization by the Hilbert curve. Both the present figure and the former figure reveal similarly steep slopes; the present figure reduces oscillations in the cumulative graph in comparison to the previous figure. The scalar statistics reflect the steeper slopes better in the present figure, due to the reduction in cancellation from oscillation.
...and 4 more figures

Cumulative differences between paired samples

TL;DR

Abstract

Cumulative differences between paired samples

Authors

TL;DR

Abstract

Table of Contents

Figures (9)