Table of Contents
Fetching ...

Learning to Rank Chain-of-Thought: Using a Small Model

Eric Hanchen Jiang, Haozheng Luo, Shengyuan Pang, Xiaomin Li, Zhenting Qi, Hengli Li, Cheng-Fu Yang, Zongyu Lin, Xinfeng Li, Hao Xu, Kai-Wei Chang, Ying Nian Wu

TL;DR

This work tackles reliable mathematical reasoning in LLMs by addressing the Best-of-N re-ranking problem with a lightweight energy-based verifier, EORM. Trained with only simple outcome labels, EORM assigns an energy to each CoT candidate and uses a pairwise Bradley-Terry objective to push correct solutions to lower energy than incorrect ones, enabling effective reranking without costly step-by-step annotations. At 55M parameters, EORM achieves state-of-the-art performance on GSM8k and MATH when integrated with open-source LLMs (e.g., GSM8k 90.7% and MATH 63.7% with Llama-3 8B) and generalizes to out-of-distribution problems and unseen models, including AIME 2024 and AGIEval Gaokao Math benchmarks. The results demonstrate strong generalization, efficiency, and practical potential for deploying more dependable LLMs in real-world reasoning tasks across diverse problem domains.

Abstract

Large Language Models (LLMs) struggle with reliable mathematical reasoning, and current verification methods are often computationally expensive. This paper introduces the Energy Outcome Reward Model (EORM), a highly efficient, lightweight post-hoc verifier designed to address this challenge. EORM uses an energy-based framework to rank Chain-of-Thought (CoT) solutions, learning to distinguish correct from incorrect reasoning using only simple outcome labels, thus eliminating the need for expensive annotations. With only 55M parameters, over 127 times smaller than typical reward models, EORM boosts the accuracy of Llama 3 8B to 90.7\% on GSM8k and 63.7\% on MATH. This performance is achieved by efficiently selecting the optimal reasoning path from a pool of candidates, allowing it to match or exceed the accuracy of far more resource-intensive Best-of-N sampling techniques. Crucially, our experiments show that EORM generalizes effectively to out-of-distribution problems and unseen models, indicating it learns fundamental principles of valid reasoning. This robustness, combined with its efficiency, establishes EORM as a practical tool for deploying more dependable LLMs in complex, real-world applications.

Learning to Rank Chain-of-Thought: Using a Small Model

TL;DR

This work tackles reliable mathematical reasoning in LLMs by addressing the Best-of-N re-ranking problem with a lightweight energy-based verifier, EORM. Trained with only simple outcome labels, EORM assigns an energy to each CoT candidate and uses a pairwise Bradley-Terry objective to push correct solutions to lower energy than incorrect ones, enabling effective reranking without costly step-by-step annotations. At 55M parameters, EORM achieves state-of-the-art performance on GSM8k and MATH when integrated with open-source LLMs (e.g., GSM8k 90.7% and MATH 63.7% with Llama-3 8B) and generalizes to out-of-distribution problems and unseen models, including AIME 2024 and AGIEval Gaokao Math benchmarks. The results demonstrate strong generalization, efficiency, and practical potential for deploying more dependable LLMs in real-world reasoning tasks across diverse problem domains.

Abstract

Large Language Models (LLMs) struggle with reliable mathematical reasoning, and current verification methods are often computationally expensive. This paper introduces the Energy Outcome Reward Model (EORM), a highly efficient, lightweight post-hoc verifier designed to address this challenge. EORM uses an energy-based framework to rank Chain-of-Thought (CoT) solutions, learning to distinguish correct from incorrect reasoning using only simple outcome labels, thus eliminating the need for expensive annotations. With only 55M parameters, over 127 times smaller than typical reward models, EORM boosts the accuracy of Llama 3 8B to 90.7\% on GSM8k and 63.7\% on MATH. This performance is achieved by efficiently selecting the optimal reasoning path from a pool of candidates, allowing it to match or exceed the accuracy of far more resource-intensive Best-of-N sampling techniques. Crucially, our experiments show that EORM generalizes effectively to out-of-distribution problems and unseen models, indicating it learns fundamental principles of valid reasoning. This robustness, combined with its efficiency, establishes EORM as a practical tool for deploying more dependable LLMs in complex, real-world applications.

Paper Structure

This paper contains 37 sections, 4 theorems, 36 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Theorem C.1

For any finite candidate set $\mathcal{Y}_{\mathrm{cand}}\subset\mathcal{Y}$, the configuration $y^*$ that minimizes the energy function $E_\theta(y)$ over this set is also the configuration that maximizes the Boltzmann probability $p_\theta(y)$:

Figures (4)

  • Figure 1: An overview of flow chart of EORM. In the EORM process, the model tokenizes the question-answer pair, then computes an energy score using an Energy-Based Model (EBM). The Bradley-Terry loss serves as the objective for reward-based fine-tuning. During deployment, the trained energy reward model computes energy scores for classification tasks.
  • Figure 2: A comparison of the parameter sizes between a standard reward model and our EORM Model. A typical reward model has approximately 7 billion parameters, while EORM has only 55 million, demonstrating a size reduction of over 127 times and highlighting EORM's efficiency.
  • Figure 3: EORM performance with varying samples per question. We conduct experiments to show how the number of samples influences the problem-solving rate, using accuracy as the metric. The results indicate that model performance improves as the number of samples increases.
  • Figure 4: Ablation studies on key components of EORM.(Left) Performance comparison between the Transformer-based EORM and a simpler MLP verifier on the GSM8k and MATH benchmarks. (Right) Impact of using a universal (GPT-2) versus a native (Llama 3) tokenizer. The results highlight the critical role of the Transformer architecture and demonstrate the model's robustness to the choice of tokenizer.

Theorems & Definitions (16)

  • Definition C.1: Energy Function
  • Definition C.2: Boltzmann Distribution
  • Definition C.3: Partition Function
  • Theorem C.1: Energy–Probability Equivalence for Ranking
  • proof : Proof of \ref{['thm:energy_prob_equiv']} (Energy–Probability Equivalence for Ranking)
  • Definition C.4: Classifier Logits and Softmax Probability
  • Proposition C.1: Implicit Energy Functions from Classifier Logits
  • proof : Proof of \ref{['prop:logits_as_negative_energies']}
  • Definition C.5: Optimal Energy Separation (Idealized Goal)
  • Definition C.6: Sigmoid Function
  • ...and 6 more