Table of Contents
Fetching ...

The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

Ke Ji, Jiahao Xu, Tian Liang, Qiuzhi Liu, Zhiwei He, Xingyu Chen, Xiaoyuan Liu, Zhijie Wang, Junying Chen, Benyou Wang, Zhaopeng Tu, Haitao Mi, Dong Yu

TL;DR

UPFT introduces Prefix Self-Consistency as a key signal for improving LLM reasoning without labeled data or heavy sampling. By training exclusively on minimal initial prefixes (as few as 8 tokens) and a light amount of full-trace data to preserve reasoning structure, UPFT achieves competitive performance with supervised methods while dramatically reducing training and sampling costs. The approach is grounded in a Bayesian perspective that separates prefix coverage from prefix accuracy and is validated across multiple backbones and reasoning benchmarks, with notable gains on complex tasks. The method offers a scalable, resource-efficient path to enhance reasoning capabilities in diverse LLM architectures and can integrate with label verification when available.

Abstract

Improving the reasoning capabilities of large language models (LLMs) typically requires supervised fine-tuning with labeled data or computationally expensive sampling. We introduce Unsupervised Prefix Fine-Tuning (UPFT), which leverages the observation of Prefix Self-Consistency -- the shared initial reasoning steps across diverse solution trajectories -- to enhance LLM reasoning efficiency. By training exclusively on the initial prefix substrings (as few as 8 tokens), UPFT removes the need for labeled data or exhaustive sampling. Experiments on reasoning benchmarks show that UPFT matches the performance of supervised methods such as Rejection Sampling Fine-Tuning, while reducing training time by 75% and sampling cost by 99%. Further analysis reveals that errors tend to appear in later stages of the reasoning process and that prefix-based training preserves the model's structural knowledge. This work demonstrates how minimal unsupervised fine-tuning can unlock substantial reasoning gains in LLMs, offering a scalable and resource-efficient alternative to conventional approaches.

The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

TL;DR

UPFT introduces Prefix Self-Consistency as a key signal for improving LLM reasoning without labeled data or heavy sampling. By training exclusively on minimal initial prefixes (as few as 8 tokens) and a light amount of full-trace data to preserve reasoning structure, UPFT achieves competitive performance with supervised methods while dramatically reducing training and sampling costs. The approach is grounded in a Bayesian perspective that separates prefix coverage from prefix accuracy and is validated across multiple backbones and reasoning benchmarks, with notable gains on complex tasks. The method offers a scalable, resource-efficient path to enhance reasoning capabilities in diverse LLM architectures and can integrate with label verification when available.

Abstract

Improving the reasoning capabilities of large language models (LLMs) typically requires supervised fine-tuning with labeled data or computationally expensive sampling. We introduce Unsupervised Prefix Fine-Tuning (UPFT), which leverages the observation of Prefix Self-Consistency -- the shared initial reasoning steps across diverse solution trajectories -- to enhance LLM reasoning efficiency. By training exclusively on the initial prefix substrings (as few as 8 tokens), UPFT removes the need for labeled data or exhaustive sampling. Experiments on reasoning benchmarks show that UPFT matches the performance of supervised methods such as Rejection Sampling Fine-Tuning, while reducing training time by 75% and sampling cost by 99%. Further analysis reveals that errors tend to appear in later stages of the reasoning process and that prefix-based training preserves the model's structural knowledge. This work demonstrates how minimal unsupervised fine-tuning can unlock substantial reasoning gains in LLMs, offering a scalable and resource-efficient alternative to conventional approaches.

Paper Structure

This paper contains 34 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a): Conventional Rejection Sampling Fine-Tuning (RFT) method (upper panel) involves generating multiple responses to a given question and then applying posterior filtering to discard trajectories that lead to incorrect answers. Finally, the correct trajectory is used for final training. In contrast, the proposed UPFT method (bottom panel) requires only prefix minimal initial tokens of a single generated sample, eliminating the need for labeled data or rejection sampling. (b): Our proposed UPFT matches the performance of supervised RFT, while reduces tuning cost by 75+%.
  • Figure 2: An empirical investigation of prefix self-consistency. We investigate (a) the average number of trajectories covered by prefixes at different lengths, and (b) the success rate of 32 rollout samplings from prefixes for both correct and incorrect trajectories.
  • Figure 3: The task template used to learn from the prefix of the reasoning traces. [question] represents the question that needs to be answered.
  • Figure 4: Impact of (a) prefix length and (b) structure tuning ratio on reasoning accuracy.
  • Figure 5: With the temperature set to 0.7, we sample 16 times based on Qwen2.5-Math-7B-Instruct for the given question, where A1-A16 represents the corresponding output results.