Early Time Classification with Accumulated Accuracy Gap Control

Liran Ringel; Regev Cohen; Daniel Freedman; Michael Elad; Yaniv Romano

Early Time Classification with Accumulated Accuracy Gap Control

Liran Ringel, Regev Cohen, Daniel Freedman, Michael Elad, Yaniv Romano

TL;DR

This work tackles the problem of labeling data streams as early as possible without sacrificing accuracy, by introducing calibrated stopping rules for ETSC with finite-sample, distribution-free guarantees. It develops two risk-control paradigms: marginal and conditional; the latter provides stronger, halt-time–aware guarantees by controlling the accuracy gap conditioned on accumulated halt times. The authors deploy a two-stage calibration framework—Stage 1 candidate screening and Stage 2 testing—grounded in Learn-then-Test and fixed-sequence testing to handle large hyperparameter spaces and ensure a fixed-sequence FWER control. Empirical results across structured datasets and an NLP reading-comprehension task show that conditional risk control can dramatically reduce computation while reliably maintaining the accuracy gap, with up to 94% of timesteps avoided in some settings. This framework offers practical, statistically justified early-exit mechanisms for sequential classifiers, with broad applicability to real-time inference and resource-constrained deployment.

Abstract

Early time classification algorithms aim to label a stream of features without processing the full input stream, while maintaining accuracy comparable to that achieved by applying the classifier to the entire input. In this paper, we introduce a statistical framework that can be applied to any sequential classifier, formulating a calibrated stopping rule. This data-driven rule attains finite-sample, distribution-free control of the accuracy gap between full and early-time classification. We start by presenting a novel method that builds on the Learn-then-Test calibration framework to control this gap marginally, on average over i.i.d. instances. As this algorithm tends to yield an excessively high accuracy gap for early halt times, our main contribution is the proposal of a framework that controls a stronger notion of error, where the accuracy gap is controlled conditionally on the accumulated halt times. Numerical experiments demonstrate the effectiveness, applicability, and usefulness of our method. We show that our proposed early stopping mechanism reduces up to 94% of timesteps used for classification while achieving rigorous accuracy gap control.

Early Time Classification with Accumulated Accuracy Gap Control

TL;DR

Abstract

Paper Structure (21 sections, 2 theorems, 10 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 21 sections, 2 theorems, 10 equations, 5 figures, 2 tables, 3 algorithms.

Introduction
A Motivating Example: Reading Comprehension
Preview of our methods
Related Work
Warm-up: Marginal Accuracy Gap Control
Conditional Accuracy Gap Control
Stage 1: Candidate Screening:
Stage 2: Testing:
Stage 1: Candidate Screening
Stage 2: Testing
Experiments
Application to Structured Data
An NLP Application
Conclusion
Marginal Risk Control Algorithm
...and 6 more sections

Key Result

Proposition 1

Assuming the calibration and test samples are i.i.d., with $\hat{\lambda}$ selected as outlined in Algorithm alg:marginal_risk_control, the stopping rule $\tau_{\hat{\lambda}}(X)$ satisfies eq:marginal_guarantee.

Figures (5)

Figure 1: An illustration of a reading comprehension task. An LLM sequentially processes the given document to find the answer to the question provided and, ideally, should stop scanning the document immediately after the required information is found. The context is taken from https://en.wikipedia.org/wiki/Ronald_Fisher.
Figure 2: Comparison between the marginal and conditional methods for the reading comprehension task. Nominal accuracy gap level is $\alpha=10\%$ and $\delta=1\%$. Left: empirical conditional accuracy gap, $\hat{R}_{\text{gap}}^{\leq t}$, across 100 trials; each curve corresponds to a different random split of the calibration and test data. Right: accumulated halt times as a function of $t$, averaged over 100 random splits; the shaded area represents a 95% confidence interval.
Figure C.3: Comparison between the marginal and conditional methods for the structured datasets. The other details are as in Figure \ref{['fig:quality_conditional_vs_marginal']}.
Figure C.4: Normalized halt time $T_{\text{avg}}$ vs. tolerable accuracy gap $\alpha$. The results are averaged over 100 random splits of the Tiselac dataset, with (tiny) standard error bars.
Figure D.5: The importance of the testing procedure---Stage 2. Comparison of conditional accuracy gap obtained by candidate screening (Stage 1, black curves) and by the full conditional method (Stage 1+2, orange curves). The results are presented for 100 random calibration/test splits of the QuALITY dataset, with each curve corresponding to a different split.

Theorems & Definitions (4)

Proposition 1
Proposition 2
proof : Proof of Proposition \ref{['prop:marginal_risk_control']}
proof : Proof of Proposition \ref{['prop:conditional_risk_control']}

Early Time Classification with Accumulated Accuracy Gap Control

TL;DR

Abstract

Early Time Classification with Accumulated Accuracy Gap Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)