Learning to Rank Chain-of-Thought: Using a Small Model
Eric Hanchen Jiang, Haozheng Luo, Shengyuan Pang, Xiaomin Li, Zhenting Qi, Hengli Li, Cheng-Fu Yang, Zongyu Lin, Xinfeng Li, Hao Xu, Kai-Wei Chang, Ying Nian Wu
TL;DR
This work tackles reliable mathematical reasoning in LLMs by addressing the Best-of-N re-ranking problem with a lightweight energy-based verifier, EORM. Trained with only simple outcome labels, EORM assigns an energy to each CoT candidate and uses a pairwise Bradley-Terry objective to push correct solutions to lower energy than incorrect ones, enabling effective reranking without costly step-by-step annotations. At 55M parameters, EORM achieves state-of-the-art performance on GSM8k and MATH when integrated with open-source LLMs (e.g., GSM8k 90.7% and MATH 63.7% with Llama-3 8B) and generalizes to out-of-distribution problems and unseen models, including AIME 2024 and AGIEval Gaokao Math benchmarks. The results demonstrate strong generalization, efficiency, and practical potential for deploying more dependable LLMs in real-world reasoning tasks across diverse problem domains.
Abstract
Large Language Models (LLMs) struggle with reliable mathematical reasoning, and current verification methods are often computationally expensive. This paper introduces the Energy Outcome Reward Model (EORM), a highly efficient, lightweight post-hoc verifier designed to address this challenge. EORM uses an energy-based framework to rank Chain-of-Thought (CoT) solutions, learning to distinguish correct from incorrect reasoning using only simple outcome labels, thus eliminating the need for expensive annotations. With only 55M parameters, over 127 times smaller than typical reward models, EORM boosts the accuracy of Llama 3 8B to 90.7\% on GSM8k and 63.7\% on MATH. This performance is achieved by efficiently selecting the optimal reasoning path from a pool of candidates, allowing it to match or exceed the accuracy of far more resource-intensive Best-of-N sampling techniques. Crucially, our experiments show that EORM generalizes effectively to out-of-distribution problems and unseen models, indicating it learns fundamental principles of valid reasoning. This robustness, combined with its efficiency, establishes EORM as a practical tool for deploying more dependable LLMs in complex, real-world applications.
