Table of Contents
Fetching ...

LACONIC: Length-Aware Constrained Reinforcement Learning for LLM

Chang Liu, Yiran Zhao, Lawrence Liu, Yaoqi Ye, Csaba Szepesvári, Lin F. Yang

TL;DR

LACONIC addresses the problem of verbose outputs in RL-tuned LLMs by enforcing an average token budget $B$ through a clipped length cost, updated by an adaptive dual variable $\lambda$. It formulates length control as a constrained RL problem and derives a clipped-cost primal–dual optimization that stabilizes policy updates while enforcing the budget. Theoretical results guarantee convergence to a feasible policy and bound the reward gap to the constrained optimum, providing near-optimality under reasonable assumptions. Empirically, LACONIC significantly reduces output length (up to ~71% in some cases) while preserving or improving pass@1 on math benchmarks, and it generalizes to out-of-domain tasks with substantial token reductions and minimal deployment overhead.

Abstract

Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning models and datasets, LACONIC preserves or improves pass@1 while reducing output length by over 50%. It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens. Moreover, LACONIC integrates into standard RL-tuning with no inference changes and minimal deployment overhead.

LACONIC: Length-Aware Constrained Reinforcement Learning for LLM

TL;DR

LACONIC addresses the problem of verbose outputs in RL-tuned LLMs by enforcing an average token budget through a clipped length cost, updated by an adaptive dual variable . It formulates length control as a constrained RL problem and derives a clipped-cost primal–dual optimization that stabilizes policy updates while enforcing the budget. Theoretical results guarantee convergence to a feasible policy and bound the reward gap to the constrained optimum, providing near-optimality under reasonable assumptions. Empirically, LACONIC significantly reduces output length (up to ~71% in some cases) while preserving or improving pass@1 on math benchmarks, and it generalizes to out-of-domain tasks with substantial token reductions and minimal deployment overhead.

Abstract

Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning models and datasets, LACONIC preserves or improves pass@1 while reducing output length by over 50%. It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens. Moreover, LACONIC integrates into standard RL-tuning with no inference changes and minimal deployment overhead.
Paper Structure (30 sections, 6 theorems, 58 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 30 sections, 6 theorems, 58 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\pi^\star\in\max_{\pi:\widetilde{C}(\pi)\le 0} R(\pi)$ be an optimal feasible policy of the length-constrained problem in eq:CMDP. Let $(\pi^\sharp,\lambda^\sharp)$ be the feasible limit of the idealized clipped-cost primal-dual updates in eq:ideal_dyn_primaleq:ideal_dyn_dual. Then Moreover, for indicator rewards with the $\lambda$-ceiling $\Lambda = \frac{B}{L_{\max} - B}$ and a maximum len

Figures (7)

  • Figure 1: The top panel skethces RL-tuning with a fixed length-aware shaping objective (blue). As the heuristically shaped objective generally differs from the true task reward $R$ (red), optimizing it may converge to a policy $\pi_N$ that is suboptimal in $R$. The bottom panel sketches training with LACONIC. LACONIC adaptively updates the length-aware objective (green) so that it better aligns with the true task reward while achieving shorter outputs, yielding near-optimal policies.
  • Figure 2: Illustration of LACONIC. LACONIC alternates two steps: (1) in a primal update, the policy model is updated on an augmented objective that trades off task reward $r$ with a length-aware cost $c$ scaled by the dual variable $\lambda$; (2) in a dual update, $\lambda$ is adaptively updated to enforce a token budget constraint $B$ by increasing when the average length $\bar{L}$ exceeds the budget $B$ and decreasing otherwise. Together, these updates maximize task reward while meeting the budget on average.
  • Figure 3: Ablation of the cost functions on DeepScaleR-1.5B with token budget $B=1500$. We plot (a) accuracy reward and (b) average response length over training steps. For the linear cost, the Langrangian reward used in primal updates is computed by $r(q,o) - \lambda\,\widetilde{c}(q,o)$, where $\widetilde{c}(q,o) = (L(o)-B)/B$. All other experiment setups and hyperparameters are identical.
  • Figure 4: Ablation of the token budget $B$ on DeepScaleR-1.5B. We plot (a) accuracy reward; (b) average response length; and (c) dual variable $\lambda$ over training steps with budgets $B\in\{1000,1500,1750,2000\}$. All other setups and hyperparameters are identical.
  • Figure 5: Average computational resource usage of LACONIC (green) and GRPO (blue).
  • ...and 2 more figures

Theorems & Definitions (11)

  • Theorem 3.1: Price of clipped cost
  • Lemma 3.1
  • proof
  • Theorem 3.2: Restate of \ref{['thm:price_of_clipping']}.
  • proof
  • Lemma 3.3: Fixed-point feasibility
  • proof
  • Theorem 3.4: Convergence
  • proof
  • Theorem 3.5: Convergence rate
  • ...and 1 more