Table of Contents
Fetching ...

LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning

Ömer Faruk Akgül, Yusuf Hakan Kalaycı, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna

TL;DR

LYNX introduces an online, confidence-controlled early-exit mechanism for reasoning models by tying exit decisions to natural thinking cues. It trains a lightweight probe on hidden states at cue tokens, supervises via forced exits, and uses split conformal prediction to calibrate a threshold that guarantees a user-specified misexit rate, all without changing decoding or relying on external verifiers. The approach generalizes across model families and tasks, achieving strong accuracy–efficiency tradeoffs on multiple math benchmarks and a non-math task, with token savings often exceeding 40% while preserving accuracy. This yields a deployment-ready method with explicit, distribution-free confidence guarantees and competitive Pareto frontiers against state-of-the-art early-exit methods.

Abstract

Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often "overthink": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., "hmm", "wait") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.

LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning

TL;DR

LYNX introduces an online, confidence-controlled early-exit mechanism for reasoning models by tying exit decisions to natural thinking cues. It trains a lightweight probe on hidden states at cue tokens, supervises via forced exits, and uses split conformal prediction to calibrate a threshold that guarantees a user-specified misexit rate, all without changing decoding or relying on external verifiers. The approach generalizes across model families and tasks, achieving strong accuracy–efficiency tradeoffs on multiple math benchmarks and a non-math task, with token savings often exceeding 40% while preserving accuracy. This yields a deployment-ready method with explicit, distribution-free confidence guarantees and competitive Pareto frontiers against state-of-the-art early-exit methods.

Abstract

Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often "overthink": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., "hmm", "wait") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.

Paper Structure

This paper contains 49 sections, 12 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Accuracy--efficiency tradeoffs under LYNX: a single confidence parameter smoothly moves the model from baseline-level accuracy to substantially more efficient generations (1.5--3.3$\times$).
  • Figure 2: LYNX pipeline: (Offline) We collect forced-exit labels at naturally occurring cue tokens from mathematical problems, train a lightweight probe on hidden states at these cues, and calibrate conformal thresholds using a held-out set. (Online) During generation, whenever a cue appears, the probe scores its hidden state and a conformalized threshold decides whether to exit early or continue reasoning.
  • Figure 3: Accuracy--efficiency Pareto frontiers for LYNX compared to DEER and Think-or-Not (ToN). Each panel plots speed-up (baseline tokens divided by method tokens) on the $x$-axis and change in accuracy vs. baseline (percentage points) on the $y$-axis. Across all settings, LYNX forms competitive or superior frontiers, with DEER slightly ahead only on QwQ-32B GSM8K.
  • Figure 4: Example outputs comparing baseline generation with LYNX early exit on GSM8K. The baseline model generates 1,105 tokens with extensive overthinking, while LYNX monitors natural reasoning cues (hmm, wait) and exits confidently after 258 tokens when the conformal predictive set contains only the correct answer. Both arrive at the same correct answer, but LYNX achieves 76.6% token reduction. The exit is triggered only when the conformal predictive set collapses to $\{1\}$, i.e., when the probe is calibrated to treat the cue as a safe exit at the chosen confidence level.
  • Figure 5: Accuracy–efficiency tradeoffs at temperature $T = 0.0$ for Llama-3.1-Nemotron-Nano-8B-v1. Each panel shows baseline chain-of-thought decoding and LYNX at multiple confidence levels $c = 1 - \delta$. Bars report accuracy, and the overlaid line reports efficiency gain (baseline tokens divided by method tokens). LYNX achieves large token savings on all datasets while substantially improving or preserving accuracy relative to the Nemotron baseline.
  • ...and 3 more figures