Table of Contents
Fetching ...

Incremental Sequence Classification with Temporal Consistency

Lucas Maystre, Gabriel Barello, Tudor Berariu, Aleix Cambray, Rares Dolga, Alvaro Ortega Gonzalez, Andrei Nica, David Barber

TL;DR

The paper introduces a temporally-consistent, TD-inspired loss (TC-$\lambda$) for incremental sequence classification, enabling predictions at every prefix with a single model. It establishes theoretical convergence and consistency in a tabular setting and demonstrates data-efficiency advantages over direct cross-entropy, both in theory and in practice. Empirically, TC-$\lambda$ improves prefix and full-sequence accuracy on text classification and yields stronger early correctness signals for GSM8K verification, enabling compute-aware inference strategies. The approach is architecture-agnostic, shows promise for scalable verification and multi-token prediction, and points to meaningful practical benefits in real-time decision-making and LLM evaluation, while noting the need for broader evaluations on larger models and multimodal tasks.

Abstract

We address the problem of incremental sequence classification, where predictions are updated as new elements in the sequence are revealed. Drawing on temporal-difference learning from reinforcement learning, we identify a temporal-consistency condition that successive predictions should satisfy. We leverage this condition to develop a novel loss function for training incremental sequence classifiers. Through a concrete example, we demonstrate that optimizing this loss can offer substantial gains in data efficiency. We apply our method to text classification tasks and show that it improves predictive accuracy over competing approaches on several benchmark datasets. We further evaluate our approach on the task of verifying large language model generations for correctness in grade-school math problems. Our results show that models trained with our method are better able to distinguish promising generations from unpromising ones after observing only a few tokens.

Incremental Sequence Classification with Temporal Consistency

TL;DR

The paper introduces a temporally-consistent, TD-inspired loss (TC-) for incremental sequence classification, enabling predictions at every prefix with a single model. It establishes theoretical convergence and consistency in a tabular setting and demonstrates data-efficiency advantages over direct cross-entropy, both in theory and in practice. Empirically, TC- improves prefix and full-sequence accuracy on text classification and yields stronger early correctness signals for GSM8K verification, enabling compute-aware inference strategies. The approach is architecture-agnostic, shows promise for scalable verification and multi-token prediction, and points to meaningful practical benefits in real-time decision-making and LLM evaluation, while noting the need for broader evaluations on larger models and multimodal tasks.

Abstract

We address the problem of incremental sequence classification, where predictions are updated as new elements in the sequence are revealed. Drawing on temporal-difference learning from reinforcement learning, we identify a temporal-consistency condition that successive predictions should satisfy. We leverage this condition to develop a novel loss function for training incremental sequence classifiers. Through a concrete example, we demonstrate that optimizing this loss can offer substantial gains in data efficiency. We apply our method to text classification tasks and show that it improves predictive accuracy over competing approaches on several benchmark datasets. We further evaluate our approach on the task of verifying large language model generations for correctness in grade-school math problems. Our results show that models trained with our method are better able to distinguish promising generations from unpromising ones after observing only a few tokens.

Paper Structure

This paper contains 33 sections, 4 theorems, 20 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

For any $M \times K$ row-stochastic $\bm{P}_0$, the fixed-point iteration eq:tabiter converges to $\bm{P}^\star$.

Figures (8)

  • Figure 1: Left: Markov chain with $T$ layers of $W$ states each, and two absorbing states. Right: Mean-squared error of the direct (DCE) and indirect (TC) estimates for a state in the first layer state as a function of $W$. We set $N = 20W$ and report the mean and 95% confidence intervals over 100.0 runs.
  • Figure 2: Predictive performance of OPT models with 125, 350, and 1.3 parameters, respectively, on the ohsumed dataset. We report the area under the ROC curve (mean and 95% confidence interval over 10 runs; higher is better).
  • Figure 3: Left: Accuracy of OPT-125M classifiers on ohsumed as a function of $\lambda$ (mean and 95% CI over 5 runs). Right: Average KL-divergence between successive predictive distributions (mean and 95% CI over 10 runs). Lower values correspond to predictive distributions that are more similar across successive time steps.
  • Figure 4: Predicted probability of correctness for two generations of Qwen2.5-0.5B for the prompt David found $12 on the street. He then gave it to his friend Evan who has $1 and needed to buy a watch worth $20. How much does Evan still need?
  • Figure 5: Incremental verification for Qwen2.5-0.5B on GSM8K. Left: The TC-$\lambda$ verifier is better at distinguishing between correct and incorrect generations early on. Center & right: A better trade-off between accuracy and compute can be obtained by stopping unpromising generations early on.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Proposition 1
  • Proposition 2
  • Proposition 3: Adapted from cheikhi2023statistical
  • proof : Proof of Proposition \ref{['prop:convergence']}
  • proof : Proof of Proposition \ref{['prop:tcequiv']}
  • proof : Proof of Proposition \ref{['prop:cheikhi']}
  • Proposition 4
  • proof