Table of Contents
Fetching ...

AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj, Huaxiu Yao

TL;DR

A novel self-supervised framework **AutoPRM** is proposed that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges and proposes context-guided decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem.

Abstract

Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework AutoPRM that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, AutoPRM first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided-decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that AutoPRM significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, AutoPRM can be easily integrated with other orthogonal reasoning pipelines.

AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

TL;DR

A novel self-supervised framework **AutoPRM** is proposed that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges and proposes context-guided decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem.

Abstract

Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework AutoPRM that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, AutoPRM first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided-decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that AutoPRM significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, AutoPRM can be easily integrated with other orthogonal reasoning pipelines.
Paper Structure (26 sections, 4 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 4 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: The decoding pipeline of our proposed AutoPRM, which consists of a unified question decomposition (QD) and question answering (QA) model. First, QD breaks down the problem into a series of sub-questions according to a user-specified granularity. Then, the RL-optimized QA model solves them sequentially via FiM-decoding, which consistently guides QA toward the solution of the primary problem.
  • Figure 2: A diagram illustrating the 3 steps of AutoPRM: (1) supervised fine-tuning (SFT) on a merged dataset of question decomposition dataset $\mathcal{D}_{QG}$ and the FIM-transformed question answering dataset $\mathcal{D}_{QA}$; (2) stepwise result verifier trained on the LLMs generated solutions of $\mathcal{D}_{QA}$; (3) RL fine-tuning against the stepwise verifier. The base model first decomposes the question into several intermediate subquestions and solve them sequentially via LGD. Then the candidates with high reward are selected to fine-tune the policy via expert iteration.
  • Figure 3: Assessment on GSM8K dataset w.r.t decomposition granularity $\epsilon$. We evaluate the final-answer accuracy, perplexity and BERT similarity (to groundtruth solutions). Accuracy demonstrates that an intermediate granularity level ($\epsilon$=0.8) yields best performance. Perplexity denotes that fine-grained guidance enhances the model's certainty in problem-solving. The increased similarity to the groundtruth solutions imply that AutoPRM effectively decompose questions that align with the human labeller.
  • Figure 4: Comparing AutoPRM and CoT+SC decoding on problems of varing complexity and with different number of subquestions. AutoPRM outperforms CoT+SC by a large margin, especially for problems with longer reasoning chains.