Implicit Probabilistic Reasoning Does Not Reflect Explicit Answers in Large Language Models
Manuel Mondal, Ljiljana Dolamic, Gérôme Bovet, Philippe Cudré-Mauroux, Julien Audiffren
TL;DR
The paper addresses how large language models (LLMs) reason about probabilities beyond answering explicit questions. It introduces implicit probabilistic reasoning, which leverages next-token distributions in text generation to assess whether probabilistic information is integrated coherently into model output, rather than relying solely on MCQ-style correctness. Across multiple open-weight models and five scenario families, the study finds a consistent gap: models perform well on explicit probabilistic questions but often misalign their implicit predictions with ground-truth probabilities, and these predictions are distorted by prior events and label biases. The findings suggest that MCQ benchmarks can overstate probabilistic competence and highlight the need for evaluation paradigms that probe the generative use of probabilistic information to improve reliability in real-world deployments.
Abstract
The handling of probabilities in the form of uncertainty or partial information is an essential task for LLMs in many settings and applications. A common approach to evaluate an LLM's probabilistic reasoning capabilities is to assess its ability to answer questions pertaining to probability through the use of multiple-choice questions (MCQs). However, this paradigm, which we refer to as explicit probabilistic reasoning, has been shown in the literature to yield significant limitations (e.g., sensitivity to answer ordering). In this work, we introduce an alternative approach, named implicit probabilistic reasoning, which evaluates the models' ability to integrate probabilistic reasoning into their text generation process. To achieve this, we rephrase MCQs as text-completion scenarios with a determined set of outcomes and compare the model's next-token probability assignments to the true likelihood of the outcomes. In line with previous work, we find that models exhibit solid performance in their explicit probabilistic reasoning (i.e., answers to MCQs). However, during text completion (i.e., implicit probabilistic reasoning), where the same information must be taken into account to generate text, the models' predictions often significantly diverge from the known ground truth. For instance, our evaluation method reveals that implicit probabilistic reasoning is improperly influenced by many factors, such as independent prior events, partial observations about a result, or statistical background information. All of these issues can cause erroneous results to be produced in text generation, which are not detected by conventional MCQ-based evaluation.
