Table of Contents
Fetching ...

Bias-Restrained Prefix Representation Finetuning for Mathematical Reasoning

Sirui Liang, Pengfei Cao, Jian Zhao, Cong Huang, Jun Zhao, Kang Liu

TL;DR

This work investigates why Representation Finetuning (ReFT) struggles with multi-step mathematical reasoning and identifies two root causes: misleading early reasoning prefixes and disturbance to numerical encoding. It introduces Bias-Restricted Prefix Representation Finetuning (BREP), combining Prefix Training with Early-Stage Intervention and Bias Constraint Training powered by a PID controller to regulate intervention magnitude. Extensive experiments across Llama3 and Qwen models show BREP outperforms both ReFT and weight-based PEFT methods on GSM8K, MATH500, and related benchmarks, while maintaining numerical faithfulness and strong generalization to out-of-domain tasks. The findings demonstrate a parameter-efficient, robust approach to mathematical reasoning that foregrounds high-quality prefixes and faithful numeric representations, with practical implications for efficient deployment of solving complex arithmetic and reasoning tasks in large language models.

Abstract

Parameter-Efficient finetuning (PEFT) enhances model performance on downstream tasks by updating a minimal subset of parameters. Representation finetuning (ReFT) methods further improve efficiency by freezing model weights and optimizing internal representations with fewer parameters than PEFT, outperforming PEFT on several tasks. However, ReFT exhibits a significant performance decline on mathematical reasoning tasks. To address this problem, the paper demonstrates that ReFT's poor performance on mathematical tasks primarily stems from its struggle to generate effective reasoning prefixes during the early inference phase. Moreover, ReFT disturbs the numerical encoding and the error accumulats during the CoT stage. Based on these observations, this paper proposes Bias-REstrained Prefix Representation FineTuning (BREP ReFT), which enhances ReFT's mathematical reasoning capability by truncating training data to optimize the generation of initial reasoning prefixes, intervening on the early inference stage to prevent error accumulation, and constraining the intervention vectors' magnitude to avoid disturbing numerical encoding. Extensive experiments across diverse model architectures demonstrate BREP's superior effectiveness, efficiency, and robust generalization capability, outperforming both standard ReFT and weight-based PEFT methods on the task of mathematical reasoning. The source code is available at https://github.com/LiangThree/BREP.

Bias-Restrained Prefix Representation Finetuning for Mathematical Reasoning

TL;DR

This work investigates why Representation Finetuning (ReFT) struggles with multi-step mathematical reasoning and identifies two root causes: misleading early reasoning prefixes and disturbance to numerical encoding. It introduces Bias-Restricted Prefix Representation Finetuning (BREP), combining Prefix Training with Early-Stage Intervention and Bias Constraint Training powered by a PID controller to regulate intervention magnitude. Extensive experiments across Llama3 and Qwen models show BREP outperforms both ReFT and weight-based PEFT methods on GSM8K, MATH500, and related benchmarks, while maintaining numerical faithfulness and strong generalization to out-of-domain tasks. The findings demonstrate a parameter-efficient, robust approach to mathematical reasoning that foregrounds high-quality prefixes and faithful numeric representations, with practical implications for efficient deployment of solving complex arithmetic and reasoning tasks in large language models.

Abstract

Parameter-Efficient finetuning (PEFT) enhances model performance on downstream tasks by updating a minimal subset of parameters. Representation finetuning (ReFT) methods further improve efficiency by freezing model weights and optimizing internal representations with fewer parameters than PEFT, outperforming PEFT on several tasks. However, ReFT exhibits a significant performance decline on mathematical reasoning tasks. To address this problem, the paper demonstrates that ReFT's poor performance on mathematical tasks primarily stems from its struggle to generate effective reasoning prefixes during the early inference phase. Moreover, ReFT disturbs the numerical encoding and the error accumulats during the CoT stage. Based on these observations, this paper proposes Bias-REstrained Prefix Representation FineTuning (BREP ReFT), which enhances ReFT's mathematical reasoning capability by truncating training data to optimize the generation of initial reasoning prefixes, intervening on the early inference stage to prevent error accumulation, and constraining the intervention vectors' magnitude to avoid disturbing numerical encoding. Extensive experiments across diverse model architectures demonstrate BREP's superior effectiveness, efficiency, and robust generalization capability, outperforming both standard ReFT and weight-based PEFT methods on the task of mathematical reasoning. The source code is available at https://github.com/LiangThree/BREP.

Paper Structure

This paper contains 39 sections, 19 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: The overview of ReFT and BREP. An example of misleading reasoning prefix and disturbing numerical encoding are shown in ReFT.
  • Figure 2: The impact of ReFT and Base prefix with various prefix length on mathematical reasoning performance.
  • Figure 3: Effect of interventions on numerical encoding. X-axis: The intensity of intervention along the direction of numerical coding. Line plot (left y-axis): Error probability of four-digit addition under interventions. Bar plot (right y-axis): Intervention intensity of ReFT projected onto the number encoding direction.
  • Figure 4: The analysis of BREP. (a) Comparison of mathematical reasoning effectiveness guided by BREP prefixes and ReFT prefixes. (b) Numerical faithfulness performance gap between BREP and Base model. (c) Numerical faithfulness performance gap between ReFT and Base model. (Red indicates faithfulness improvement, blue indicates faithfulness degradation.)
  • Figure 5: The relationship between model performance and bias magnitude.
  • ...and 10 more figures