Table of Contents
Fetching ...

Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

Jinyan Su, Claire Cardie

TL;DR

The paper tackles the problem of verbose reasoning in RL-trained LLMs by introducing Adaptive Direct Length Penalty (A-DLP), a reward shaping technique that dynamically adjusts the length penalty based on current accuracy to speed up length reduction while preserving correctness. A-DLP contrasts with Static Direct Length Penalty (S-DLP) by updating the penalty coefficient in response to the accuracy gap against a reference, enabling aggressive early compression and gradual relaxation as performance evolves. Empirical results on math benchmarks show A-DLP consistently reduces token length by over 50% with minimal accuracy loss, outperforming fixed-penalty baselines and showing robust training behavior without collapses. The method is lightweight, integrates into existing RL pipelines, and has practical implications for reducing inference costs in large-scale LLM reasoning systems.

Abstract

Large language models (LLMs) have demonstrated strong reasoning abilities in mathematical tasks, often enhanced through reinforcement learning (RL). However, RL-trained models frequently produce unnecessarily long reasoning traces -- even for simple queries -- leading to increased inference costs and latency. While recent approaches attempt to control verbosity by adding length penalties to the reward function, these methods rely on fixed penalty terms that are hard to tune and cannot adapt as the model's reasoning capability evolves, limiting their effectiveness. In this work, we propose an adaptive reward-shaping method that enables LLMs to "think fast and right" -- producing concise outputs without sacrificing correctness. Our method dynamically adjusts the reward trade-off between accuracy and response length based on model performance: when accuracy is high, the length penalty increases to encourage faster length reduction; when accuracy drops, the penalty is relaxed to preserve correctness. This adaptive reward accelerates early-stage length reduction while avoiding over-compression in later stages. Experiments across multiple datasets show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy, offering a new direction for cost-efficient adaptive reasoning in large-scale language models.

Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

TL;DR

The paper tackles the problem of verbose reasoning in RL-trained LLMs by introducing Adaptive Direct Length Penalty (A-DLP), a reward shaping technique that dynamically adjusts the length penalty based on current accuracy to speed up length reduction while preserving correctness. A-DLP contrasts with Static Direct Length Penalty (S-DLP) by updating the penalty coefficient in response to the accuracy gap against a reference, enabling aggressive early compression and gradual relaxation as performance evolves. Empirical results on math benchmarks show A-DLP consistently reduces token length by over 50% with minimal accuracy loss, outperforming fixed-penalty baselines and showing robust training behavior without collapses. The method is lightweight, integrates into existing RL pipelines, and has practical implications for reducing inference costs in large-scale LLM reasoning systems.

Abstract

Large language models (LLMs) have demonstrated strong reasoning abilities in mathematical tasks, often enhanced through reinforcement learning (RL). However, RL-trained models frequently produce unnecessarily long reasoning traces -- even for simple queries -- leading to increased inference costs and latency. While recent approaches attempt to control verbosity by adding length penalties to the reward function, these methods rely on fixed penalty terms that are hard to tune and cannot adapt as the model's reasoning capability evolves, limiting their effectiveness. In this work, we propose an adaptive reward-shaping method that enables LLMs to "think fast and right" -- producing concise outputs without sacrificing correctness. Our method dynamically adjusts the reward trade-off between accuracy and response length based on model performance: when accuracy is high, the length penalty increases to encourage faster length reduction; when accuracy drops, the penalty is relaxed to preserve correctness. This adaptive reward accelerates early-stage length reduction while avoiding over-compression in later stages. Experiments across multiple datasets show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy, offering a new direction for cost-efficient adaptive reasoning in large-scale language models.

Paper Structure

This paper contains 23 sections, 4 equations, 15 figures.

Figures (15)

  • Figure 1: Performance comparison of A-DLP with baseline methods. For the S-DLP, we plot checkpoints sampled every 20 training steps and fit a curve through them. Since both accuracy and generation length change monotonically during training under S-DLP, this trajectory captures the full accuracy–length trade-off. A-DLP consistently achieves better trade-offs, lying above and to the left of the S-DLP curve.
  • Figure 2: Accuracy and average token length across training steps for A-DLP and S-DLP. The dotted line is the accuracy and token length for the base model before length reduction. For S-DLP, performance remains stable during the early training phase, but both accuracy and token length drop sharply around step 100, indicating model collapse due to excessive length penalization. In contrast, A-DLP exhibits stable convergence, with both metrics gradually stabilizing—demonstrating its ability to adaptively balance correctness and brevity throughout training.
  • Figure 3: Token length of correct and incorrect responses before and after applying A-DLP. The reduction rates for both categories consistently exceed 55%.
  • Figure 4: Training dynamics of A-DLP showing the accuracy gap between the current model and the reference threshold ($\text{acc}_t - \text{acc}_{\text{ref}}$), the length penalty coefficient $\lambda_t$, validation accuracy on AIME2024, and the average response length (number of tokens) on the training data.
  • Figure 5: Training dynamics of $\lambda_t$ and response length under different learning rates ($\eta \in \{10^{-2}, 10^{-3}, 10^{-4}\}$). A larger learning rate causes $\lambda_t$ to fluctuate more sharply due to sensitivity to noisy accuracy estimates, resulting in slower length reduction but eventually converges with sufficient training. In contrast, a smaller learning rate leads to smoother updates and faster token reduction in the early training stage, but risks model collapse in later stages, as $\lambda_t$ fails to decrease quickly enough in response to dropping accuracy—causing continued over-penalization and excessive length reduction.
  • ...and 10 more figures