Table of Contents
Fetching ...

Refining Answer Distributions for Improved Large Language Model Reasoning

Soumyasundar Pal, Didier Chételat, Yingxue Zhang, Mark Coates

TL;DR

The paper tackles the plateau in reasoning performance from single-shot prompting and fixed multi-sample approaches by introducing Refined Answer Distributions (RAD), an iterative framework that maintains and updates a distribution over candidate answers $\{p_r(\tilde{y}|x)\}_{r}\$ via marginalization over previous answers. RAD initializes from a base reasoning distribution and refines it through successive Refine(·) prompts, updating $p_{r+1}(\tilde{y}|x)$ with $p(\tilde{y}|x, \mathrm{Refine}(y'))$ weighted by $p_r(y'|x)$, and uses Monte Carlo estimation when necessary. The method is prompting-strategy agnostic and relies on a probability-flow criterion, enabling more effective use of LLM calls and reduced sampling variance. Empirically, RAD improves reasoning accuracy across diverse benchmarks (arithmetic, MATH, BIG-Bench Hard) and model families (GPT-3.5/4, GPT-4o-mini, LLaMA variants) with comparable costs, outperforming Self-Consistency and other refinement methods in the majority of configurations. The work demonstrates that maintaining a distribution over answers and refining it iteratively can meaningfully enhance LLM reasoning, with broad applicability to various reasoning tasks and prompting regimes.

Abstract

Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to generate a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Refined Answer Distributions, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode -- the most likely answer. Empirical evaluation on several reasoning benchmarks demonstrates the superiority of the proposed approach.

Refining Answer Distributions for Improved Large Language Model Reasoning

TL;DR

The paper tackles the plateau in reasoning performance from single-shot prompting and fixed multi-sample approaches by introducing Refined Answer Distributions (RAD), an iterative framework that maintains and updates a distribution over candidate answers via marginalization over previous answers. RAD initializes from a base reasoning distribution and refines it through successive Refine(·) prompts, updating with weighted by , and uses Monte Carlo estimation when necessary. The method is prompting-strategy agnostic and relies on a probability-flow criterion, enabling more effective use of LLM calls and reduced sampling variance. Empirically, RAD improves reasoning accuracy across diverse benchmarks (arithmetic, MATH, BIG-Bench Hard) and model families (GPT-3.5/4, GPT-4o-mini, LLaMA variants) with comparable costs, outperforming Self-Consistency and other refinement methods in the majority of configurations. The work demonstrates that maintaining a distribution over answers and refining it iteratively can meaningfully enhance LLM reasoning, with broad applicability to various reasoning tasks and prompting regimes.

Abstract

Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to generate a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Refined Answer Distributions, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode -- the most likely answer. Empirical evaluation on several reasoning benchmarks demonstrates the superiority of the proposed approach.

Paper Structure

This paper contains 23 sections, 6 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of one iteration of our proposed method, Refined Answer Distribution (RAD). At its initialization, a distribution of answers is obtained from the LLM via multiple queries. In each subsequent iteration, new answers are sampled by refining each distinct old answer. The resulting samples are then accordingly weighted by the probability of the previous answers for marginalization.
  • Figure 2: The estimated probabilities of different answers from CoT+SC, PHP+SC, and CoT+RAD (using GPT-3.5 Turbo) for an example from GSM8K dataset. Question: The ice cream parlor was offering a deal, buy 2 scoops of ice cream, get 1 scoop free. Each scoop cost $1.50. If Erin had $6.00, how many scoops of ice cream should she buy? Answer:6.
  • Figure 3: Histogram of ranks of the algorithms (the highest probability of the correct answer results in the lowest rank) for the 'difficult' questions from all six arithmetic datasets using GPT-4o-mini.
  • Figure 4: Histogram of ranks of the algorithms (the highest probability of the correct answer results in the lowest rank) for the 'difficult' questions from all six arithmetic datasets using GPT-3.5 Turbo.
  • Figure 5: Histogram of ranks of the algorithms (the highest probability of the correct answer results in the lowest rank) for the 'difficult' questions from all six arithmetic datasets using GPT-4 Turbo.