Refining Answer Distributions for Improved Large Language Model Reasoning
Soumyasundar Pal, Didier Chételat, Yingxue Zhang, Mark Coates
TL;DR
The paper tackles the plateau in reasoning performance from single-shot prompting and fixed multi-sample approaches by introducing Refined Answer Distributions (RAD), an iterative framework that maintains and updates a distribution over candidate answers $\{p_r(\tilde{y}|x)\}_{r}\$ via marginalization over previous answers. RAD initializes from a base reasoning distribution and refines it through successive Refine(·) prompts, updating $p_{r+1}(\tilde{y}|x)$ with $p(\tilde{y}|x, \mathrm{Refine}(y'))$ weighted by $p_r(y'|x)$, and uses Monte Carlo estimation when necessary. The method is prompting-strategy agnostic and relies on a probability-flow criterion, enabling more effective use of LLM calls and reduced sampling variance. Empirically, RAD improves reasoning accuracy across diverse benchmarks (arithmetic, MATH, BIG-Bench Hard) and model families (GPT-3.5/4, GPT-4o-mini, LLaMA variants) with comparable costs, outperforming Self-Consistency and other refinement methods in the majority of configurations. The work demonstrates that maintaining a distribution over answers and refining it iteratively can meaningfully enhance LLM reasoning, with broad applicability to various reasoning tasks and prompting regimes.
Abstract
Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to generate a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Refined Answer Distributions, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode -- the most likely answer. Empirical evaluation on several reasoning benchmarks demonstrates the superiority of the proposed approach.
