Table of Contents
Fetching ...

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim

Abstract

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Abstract

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

Paper Structure

This paper contains 22 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: While standard RL trains LMs to consistently output the most likely answer to a question, Multi-Answer Reinforcement Learning trains models to output distributions of diverse answers.
  • Figure 2: On DDXPlus (left) and MBPP (right), we generate 30 answers from RLVR - 30 individual samples from RLVR Single, and 10 sets of 3 from RLVR Multi. Despite equal total generations, RLVR-Multi produces significantly more unique correct answers than RLVR-Single, indicating that RLVR-Single’s mode-seeking behavior limits diversity.
  • Figure 3: Calibration curves on DDXPlus. RLCR-Multi is significantly better calibrated than RLVR-Multi, though it diverges at higher confidences. RLVR-Multi remains systematically overconfident. The size of each dot corresponds to how many examples are found in that bucket.
  • Figure 4: Distribution of the number of unique diagnoses per question across 5,000 test examples of DDXPlus. RLVR-Multi produces more distinct diagnoses than RLVR-Single, explaining coverage gains under multi-answer training.
  • Figure 5: Significant subsequence overlap between independently sampled RLVR-Single responses, even those that yield different answers, indicating that independent sampling largely re-instantiates the same reasoning tokens. Multi-Answer RL mitigates this effect by optimizing multiple generations jointly, which reduces repeated token sequences and yields lower within-question overlap.
  • ...and 3 more figures