Reasoning with Reinforced Functional Token Tuning

Kongcheng Zhang; Qi Yao; Baisheng Lai; Jiaxing Huang; Wenkai Fang; Dacheng Tao; Mingli Song; Shunyu Liu

Reasoning with Reinforced Functional Token Tuning

Kongcheng Zhang, Qi Yao, Baisheng Lai, Jiaxing Huang, Wenkai Fang, Dacheng Tao, Mingli Song, Shunyu Liu

TL;DR

Reasoning with Reinforced Functional Token Tuning (RFTT) introduces learnable functional tokens embedded in LLM vocabularies to enable self-play reasoning. It pairs a supervised fine-tuning warmup using functional prompts with an online reinforcement learning phase that samples functional tokens to autonomously expand reasoning trees, guided by a KL-penalized objective and optional Process/Outcome Reward Models. Empirical results on mathematical benchmarks show substantial improvements for small models (e.g., $\text{Qwen-2.5-7B-Instruct}$ from $70.6\%$ to $79.8\%$ and $\text{LLaMA-3.1-8B-Instruct}$ from $32.2\%$ to $60.2\%$ on MATH), with performance continuing to rise as more rollouts are performed at inference. The approach demonstrates robust improvements over baselines and highlights the potential of token-guided learning to internalize reasoning patterns in resource-constrained LLMs, with code available for replication.

Abstract

In this work, we propose Reinforced Functional Token Tuning (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with self-play learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (e.g., <analyze>, <verify>, <refine>) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for reasoning; and (2) online reinforcement learning further allows the model to explore different reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks, significantly boosting Qwen-2.5-7B-Instruct (70.6% to 79.8%) and LLaMA-3.1-8B-Instruct (32.2% to 60.2%) on the MATH dataset. Moreover, the performance of RFTT consistently improves with more search rollouts at inference time. Our code is available at https://github.com/sastpg/RFTT.

Reasoning with Reinforced Functional Token Tuning

TL;DR

Abstract

Reasoning with Reinforced Functional Token Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)