Table of Contents
Fetching ...

BARD: budget-aware reasoning distillation

Lujie Niu, Lei Shen, Yi Jiang, Caixia Yuan, Xiaojie Wang, Wenbo Su, Bo zheng

TL;DR

BARD addresses the challenge of transferring reasoning capabilities to smaller models while enabling fine-grained control over the length of reasoning, reducing computational costs. It introduces a two-phase training regimen: budget-aware supervised fine-tuning with contrastive data, and reinforcement learning with a multiplicative reward that jointly optimizes accuracy and budget fidelity. Experiments on AIME24, AIME25, and GPQA show that an 8B student can achieve strong reasoning performance while flexibly adjusting CoT length across budgets, outperforming naive truncation and budget-agnostic distillation. The approach demonstrates adaptive reasoning strategies under different budget constraints, paving the way for resource-aware deployment of reasoning-intensive models.

Abstract

While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.

BARD: budget-aware reasoning distillation

TL;DR

BARD addresses the challenge of transferring reasoning capabilities to smaller models while enabling fine-grained control over the length of reasoning, reducing computational costs. It introduces a two-phase training regimen: budget-aware supervised fine-tuning with contrastive data, and reinforcement learning with a multiplicative reward that jointly optimizes accuracy and budget fidelity. Experiments on AIME24, AIME25, and GPQA show that an 8B student can achieve strong reasoning performance while flexibly adjusting CoT length across budgets, outperforming naive truncation and budget-agnostic distillation. The approach demonstrates adaptive reasoning strategies under different budget constraints, paving the way for resource-aware deployment of reasoning-intensive models.

Abstract

While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.

Paper Structure

This paper contains 24 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An overview of the Budget-Aware Reasoning Distillation (BARD) framework. Two main phases: (1) an SFT phase where the model learns to understand budget constraints by training on CoTs expertly compressed to different budget levels, and (2) an RL phase where the model's policy is refined using a reward signal that combines accuracy and budget fidelity.
  • Figure 2: Budget Fidelity and Accuracy Scaling: The top row plots the generated reasoning length against the specified budget for BARD, where n/a means no budget constraints. The bottom row plots accuracy as a function of the thinking budget. We compare BARD's scaling performance against the post-hoc truncation baseline (Budget Forcing). The performance of the unconstrained SFT-Full and BARD w/o budget models are shown as horizontal dashed lines, representing the practical performance ceilings.
  • Figure 3: The Indispensable Role of Budget-Aware SFT. This figure demonstrates the outcome of a model trained with Reinforcement Learning directly, omitting our proposed budget-aware SFT phase. The plot shows the generated reasoning length versus the specified budget. This result confirms that our budget-aware SFT phase is a critical bootstrapping step for instilling an understanding of budget constraints in the model.
  • Figure 4: The left plot shows the distribution of thinking budgets. The middle and right plots compare model outputs without and with contrastive data, respectively. The model trained with contrastive data exhibits stronger adherence to the target reasoning length across a wider range, though performance in the long-tail regions remains limited.
  • Figure 5: RL refinement also leads to a consistent improvement in task accuracy, indicating the model learns a more optimal reasoning policy beyond simple pattern imitation.
  • ...and 2 more figures