Table of Contents
Fetching ...

Implicit Probabilistic Reasoning Does Not Reflect Explicit Answers in Large Language Models

Manuel Mondal, Ljiljana Dolamic, Gérôme Bovet, Philippe Cudré-Mauroux, Julien Audiffren

TL;DR

The paper addresses how large language models (LLMs) reason about probabilities beyond answering explicit questions. It introduces implicit probabilistic reasoning, which leverages next-token distributions in text generation to assess whether probabilistic information is integrated coherently into model output, rather than relying solely on MCQ-style correctness. Across multiple open-weight models and five scenario families, the study finds a consistent gap: models perform well on explicit probabilistic questions but often misalign their implicit predictions with ground-truth probabilities, and these predictions are distorted by prior events and label biases. The findings suggest that MCQ benchmarks can overstate probabilistic competence and highlight the need for evaluation paradigms that probe the generative use of probabilistic information to improve reliability in real-world deployments.

Abstract

The handling of probabilities in the form of uncertainty or partial information is an essential task for LLMs in many settings and applications. A common approach to evaluate an LLM's probabilistic reasoning capabilities is to assess its ability to answer questions pertaining to probability through the use of multiple-choice questions (MCQs). However, this paradigm, which we refer to as explicit probabilistic reasoning, has been shown in the literature to yield significant limitations (e.g., sensitivity to answer ordering). In this work, we introduce an alternative approach, named implicit probabilistic reasoning, which evaluates the models' ability to integrate probabilistic reasoning into their text generation process. To achieve this, we rephrase MCQs as text-completion scenarios with a determined set of outcomes and compare the model's next-token probability assignments to the true likelihood of the outcomes. In line with previous work, we find that models exhibit solid performance in their explicit probabilistic reasoning (i.e., answers to MCQs). However, during text completion (i.e., implicit probabilistic reasoning), where the same information must be taken into account to generate text, the models' predictions often significantly diverge from the known ground truth. For instance, our evaluation method reveals that implicit probabilistic reasoning is improperly influenced by many factors, such as independent prior events, partial observations about a result, or statistical background information. All of these issues can cause erroneous results to be produced in text generation, which are not detected by conventional MCQ-based evaluation.

Implicit Probabilistic Reasoning Does Not Reflect Explicit Answers in Large Language Models

TL;DR

The paper addresses how large language models (LLMs) reason about probabilities beyond answering explicit questions. It introduces implicit probabilistic reasoning, which leverages next-token distributions in text generation to assess whether probabilistic information is integrated coherently into model output, rather than relying solely on MCQ-style correctness. Across multiple open-weight models and five scenario families, the study finds a consistent gap: models perform well on explicit probabilistic questions but often misalign their implicit predictions with ground-truth probabilities, and these predictions are distorted by prior events and label biases. The findings suggest that MCQ benchmarks can overstate probabilistic competence and highlight the need for evaluation paradigms that probe the generative use of probabilistic information to improve reliability in real-world deployments.

Abstract

The handling of probabilities in the form of uncertainty or partial information is an essential task for LLMs in many settings and applications. A common approach to evaluate an LLM's probabilistic reasoning capabilities is to assess its ability to answer questions pertaining to probability through the use of multiple-choice questions (MCQs). However, this paradigm, which we refer to as explicit probabilistic reasoning, has been shown in the literature to yield significant limitations (e.g., sensitivity to answer ordering). In this work, we introduce an alternative approach, named implicit probabilistic reasoning, which evaluates the models' ability to integrate probabilistic reasoning into their text generation process. To achieve this, we rephrase MCQs as text-completion scenarios with a determined set of outcomes and compare the model's next-token probability assignments to the true likelihood of the outcomes. In line with previous work, we find that models exhibit solid performance in their explicit probabilistic reasoning (i.e., answers to MCQs). However, during text completion (i.e., implicit probabilistic reasoning), where the same information must be taken into account to generate text, the models' predictions often significantly diverge from the known ground truth. For instance, our evaluation method reveals that implicit probabilistic reasoning is improperly influenced by many factors, such as independent prior events, partial observations about a result, or statistical background information. All of these issues can cause erroneous results to be produced in text generation, which are not detected by conventional MCQ-based evaluation.
Paper Structure (48 sections, 9 figures, 8 tables)

This paper contains 48 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration of implicit and explicit reasoning with probabilities, for a scenario with two six-sided dice. Examples of detailed prompts can be found in Appendix \ref{['app:scenarios']}.
  • Figure 2: The model is provided with identical background statistics on a medical condition. Note that given the provided information, the likelihood that Sam suffers from anxiety is $18\%\times13\% + 82\%\times10\% \approx 11\%$.
  • Figure 3: Evaluation prompts and outputs for the explicit (left) and implicit (right) probabilistic reasoning evaluation paradigms. The prompt is displayed in light red, the model’s answer in dark red.
  • Figure 4: Coin flip scenario (unbiased coins) with the variants regular (one set of coin flips), independent (one set of prior coin flips), and dependent (sum of the current and the previous coin flip) variants. Accuracies for explicit (orange) and implicit (blue) probabilistic reasoning settings.
  • Figure 5: Accuracies for explicit (orange) and implicit (blue) probabilistic reasoning tasks in the Dice scenario with one or two observations on one or two dice.
  • ...and 4 more figures