Table of Contents
Fetching ...

Reasoning with Reinforced Functional Token Tuning

Kongcheng Zhang, Qi Yao, Baisheng Lai, Jiaxing Huang, Wenkai Fang, Dacheng Tao, Mingli Song, Shunyu Liu

TL;DR

Reasoning with Reinforced Functional Token Tuning (RFTT) introduces learnable functional tokens embedded in LLM vocabularies to enable self-play reasoning. It pairs a supervised fine-tuning warmup using functional prompts with an online reinforcement learning phase that samples functional tokens to autonomously expand reasoning trees, guided by a KL-penalized objective and optional Process/Outcome Reward Models. Empirical results on mathematical benchmarks show substantial improvements for small models (e.g., $\text{Qwen-2.5-7B-Instruct}$ from $70.6\%$ to $79.8\%$ and $\text{LLaMA-3.1-8B-Instruct}$ from $32.2\%$ to $60.2\%$ on MATH), with performance continuing to rise as more rollouts are performed at inference. The approach demonstrates robust improvements over baselines and highlights the potential of token-guided learning to internalize reasoning patterns in resource-constrained LLMs, with code available for replication.

Abstract

In this work, we propose Reinforced Functional Token Tuning (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with self-play learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (e.g., <analyze>, <verify>, <refine>) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for reasoning; and (2) online reinforcement learning further allows the model to explore different reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks, significantly boosting Qwen-2.5-7B-Instruct (70.6% to 79.8%) and LLaMA-3.1-8B-Instruct (32.2% to 60.2%) on the MATH dataset. Moreover, the performance of RFTT consistently improves with more search rollouts at inference time. Our code is available at https://github.com/sastpg/RFTT.

Reasoning with Reinforced Functional Token Tuning

TL;DR

Reasoning with Reinforced Functional Token Tuning (RFTT) introduces learnable functional tokens embedded in LLM vocabularies to enable self-play reasoning. It pairs a supervised fine-tuning warmup using functional prompts with an online reinforcement learning phase that samples functional tokens to autonomously expand reasoning trees, guided by a KL-penalized objective and optional Process/Outcome Reward Models. Empirical results on mathematical benchmarks show substantial improvements for small models (e.g., from to and from to on MATH), with performance continuing to rise as more rollouts are performed at inference. The approach demonstrates robust improvements over baselines and highlights the potential of token-guided learning to internalize reasoning patterns in resource-constrained LLMs, with code available for replication.

Abstract

In this work, we propose Reinforced Functional Token Tuning (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with self-play learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (e.g., <analyze>, <verify>, <refine>) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for reasoning; and (2) online reinforcement learning further allows the model to explore different reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks, significantly boosting Qwen-2.5-7B-Instruct (70.6% to 79.8%) and LLaMA-3.1-8B-Instruct (32.2% to 60.2%) on the MATH dataset. Moreover, the performance of RFTT consistently improves with more search rollouts at inference time. Our code is available at https://github.com/sastpg/RFTT.

Paper Structure

This paper contains 22 sections, 41 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (Left) A conceptual illustration of reasoning path generation based on functional tree search. (Right) An overview of the RFTT framework that comprises two phases: supervised fine-tuning warmups the model with self-generated functional token-annotated data, while online reinforcement learning allows the model to perform autonomous reasoning path exploration for self-improvement.
  • Figure 2: An illustrative diagram of the online reinforcement learning phase in RFTT. The LLM policy directly samples functional tokens from its vocabulary to autonomously expand reasoning trees to search for the final solutions. Then we use online reinforcement learning with process rewards to optimize the functional reasoning capabilities of the LLM policy.
  • Figure 3: Performance gains under scaling up the inference-time computation on the MATH and AMC benchmarks.
  • Figure 4: The training curve of RFTT with and w/o PRM on the training dataset during RL.
  • Figure 5: Detailed statistical information of training dataset in two-phase training.
  • ...and 1 more figures