Table of Contents
Fetching ...

Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning

Jiachun Li, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, Jun Zhao

TL;DR

This work analyzes how reward models (RM) influence reasoning in chain-of-thought prompts, showing that RM can hurt performance on simple questions, struggle with discriminating low-frequency negatives, and perform poorly under high search diversity. The authors formalize RM-based inference as $\mathbb{R} = S(M(q),N;\Phi)$ and $\hat{r} = \underset{r \in \mathbb{R}}{\arg\max} f(r)$, and they conduct extensive experiments with BoN and MCTS across multiple datasets (e.g., MATH, GSM8k, OlympiadBench) to diagnose root causes. They introduce Optimal Clustering Tree Search (OCTS), a three-step method (exploration, selection, expansion) that mitigates the inverse long-tail in RM scoring and reduces intermediate-state diversity, achieving up to 3.2% accuracy gains over baselines. The results highlight the need to consider question difficulty and search diversity when deploying RM-based reasoning in LLMs, and point to practical improvements for faithfulness and effectiveness in CoT reasoning.

Abstract

Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks. Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.

Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning

TL;DR

This work analyzes how reward models (RM) influence reasoning in chain-of-thought prompts, showing that RM can hurt performance on simple questions, struggle with discriminating low-frequency negatives, and perform poorly under high search diversity. The authors formalize RM-based inference as and , and they conduct extensive experiments with BoN and MCTS across multiple datasets (e.g., MATH, GSM8k, OlympiadBench) to diagnose root causes. They introduce Optimal Clustering Tree Search (OCTS), a three-step method (exploration, selection, expansion) that mitigates the inverse long-tail in RM scoring and reduces intermediate-state diversity, achieving up to 3.2% accuracy gains over baselines. The results highlight the need to consider question difficulty and search diversity when deploying RM-based reasoning in LLMs, and point to practical improvements for faithfulness and effectiveness in CoT reasoning.

Abstract

Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks. Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.

Paper Structure

This paper contains 57 sections, 2 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: The performance of different policy models using various reward models for BoN inference on the MATH dataset ($N$ = 10).
  • Figure 2: Performance of BoN inference across different question difficulty levels.
  • Figure 3: Performance of MCTS inference across different question difficulty levels.
  • Figure 4: Two inference methods performance across difference sampling numbers.
  • Figure 5: The variation in question answering correctness as the sampling number changes. Blue indicates a correct answer, while red indicates an incorrect answer.
  • ...and 17 more figures