Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning
Ziang Ye, Zhenru Zhang, Yang Zhang, Jianxin Ma, Junyang Lin, Fuli Feng
TL;DR
The paper addresses the problem that agent-tuning data contain tokens with distinct roles, notably reasoning versus boilerplate tokens, which are learned at different rates. It introduces SHAD, a shuffle-aware discriminator, which classifies tokens by comparing token-level losses after shuffling input-output pairs, using $LD(y_k)=l_s(y_k)-l_o(y_k)$ to label boilerplate ($LD(y_k)\le0$) or reasoning tokens. Building on SHAD, it proposes Reasoning-highlighted Fine-Tuning (RFT), which applies adaptive weights with $\mathcal{L}_{RFT}= \omega_b \mathcal{L}_b + \omega_r \mathcal{L}_r$, where $\omega_b$ and $\omega_r$ are softmax-normalized by a temperature $\tau$, to prioritize learning of reasoning tokens. Empirical results on ToolBench, APIGen, and ShareGPT-based data show SHAD+RFT improves agent capabilities across held-in and held-out benchmarks, illustrating the value of token-level discrimination and adaptive weighting for fine-tuning large language models. The work highlights a practical approach to mitigate overfitting to boilerplate patterns and enhance multi-step reasoning and tool use in real-world tasks.
Abstract
When using agent-task datasets to enhance agent capabilities for Large Language Models (LLMs), current methodologies often treat all tokens within a sample equally. However, we argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens (e.g., those governing output format) - differ significantly in importance and learning complexity, necessitating their disentanglement and distinct treatment. To address this, we propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. SHAD classifies tokens by exploiting predictability differences observed after shuffling input-output combinations across samples: boilerplate tokens, due to their repetitive nature among samples, maintain predictability, whereas reasoning tokens do not. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning, yielding notable performance gains over common Supervised Fine-Tuning (SFT).
