LACONIC: Length-Aware Constrained Reinforcement Learning for LLM
Chang Liu, Yiran Zhao, Lawrence Liu, Yaoqi Ye, Csaba Szepesvári, Lin F. Yang
TL;DR
LACONIC addresses the problem of verbose outputs in RL-tuned LLMs by enforcing an average token budget $B$ through a clipped length cost, updated by an adaptive dual variable $\lambda$. It formulates length control as a constrained RL problem and derives a clipped-cost primal–dual optimization that stabilizes policy updates while enforcing the budget. Theoretical results guarantee convergence to a feasible policy and bound the reward gap to the constrained optimum, providing near-optimality under reasonable assumptions. Empirically, LACONIC significantly reduces output length (up to ~71% in some cases) while preserving or improving pass@1 on math benchmarks, and it generalizes to out-of-domain tasks with substantial token reductions and minimal deployment overhead.
Abstract
Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning models and datasets, LACONIC preserves or improves pass@1 while reducing output length by over 50%. It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens. Moreover, LACONIC integrates into standard RL-tuning with no inference changes and minimal deployment overhead.
