Table of Contents
Fetching ...

Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models

Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste

TL;DR

<3-5 sentence high-level summary> The paper investigates whether large language models (LLMs) form and update probabilistic beliefs in line with Bayesian inference. It introduces a flight-recommendation task and a normative Bayesian Assistant as a gold-standard updater, revealing that off-the-shelf LLMs struggle to update beliefs over rounds. It then shows that supervised fine-tuning via Bayesian teaching—training LLMs to mimic the Bayesian Assistant—substantially improves belief updating and allows generalization to new tasks and domains (hotel recommendations and web shopping), including interactions with real humans. The findings demonstrate that LLMs can acquire transferable probabilistic reasoning skills from demonstrations, enabling robust decision-making under uncertainty across diverse applications and settings.

Abstract

Artificial intelligence systems based on large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs need to construct internal representations of the world and form probabilistic beliefs about those representations. To provide a user with personalized recommendations, for example, the LLM needs to gradually infer the user's preferences, over the course of multiple interactions. To evaluate whether contemporary LLMs are able to do so, we use the Bayesian inference framework from probability theory, which lays out the optimal way to update an agent's beliefs as it receives new information. We first show that LLMs do not update their beliefs as expected from the Bayesian framework, and that consequently their predictions do not improve as expected as more information becomes available. To address this issue, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of the normative Bayesian model. We find that this approach not only significantly improves the LLM's performance on the particular recommendation task it is trained on, but also enables generalization to other tasks. This suggests that this method teaches the LLM to better approximate Bayesian reasoning. More generally, our results indicate that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.

Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models

TL;DR

<3-5 sentence high-level summary> The paper investigates whether large language models (LLMs) form and update probabilistic beliefs in line with Bayesian inference. It introduces a flight-recommendation task and a normative Bayesian Assistant as a gold-standard updater, revealing that off-the-shelf LLMs struggle to update beliefs over rounds. It then shows that supervised fine-tuning via Bayesian teaching—training LLMs to mimic the Bayesian Assistant—substantially improves belief updating and allows generalization to new tasks and domains (hotel recommendations and web shopping), including interactions with real humans. The findings demonstrate that LLMs can acquire transferable probabilistic reasoning skills from demonstrations, enabling robust decision-making under uncertainty across diverse applications and settings.

Abstract

Artificial intelligence systems based on large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs need to construct internal representations of the world and form probabilistic beliefs about those representations. To provide a user with personalized recommendations, for example, the LLM needs to gradually infer the user's preferences, over the course of multiple interactions. To evaluate whether contemporary LLMs are able to do so, we use the Bayesian inference framework from probability theory, which lays out the optimal way to update an agent's beliefs as it receives new information. We first show that LLMs do not update their beliefs as expected from the Bayesian framework, and that consequently their predictions do not improve as expected as more information becomes available. To address this issue, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of the normative Bayesian model. We find that this approach not only significantly improves the LLM's performance on the particular recommendation task it is trained on, but also enables generalization to other tasks. This suggests that this method teaches the LLM to better approximate Bayesian reasoning. More generally, our results indicate that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.

Paper Structure

This paper contains 68 sections, 5 equations, 27 figures, 15 tables.

Figures (27)

  • Figure 1: The flight recommendation task (left) involves multi-round interactions between a user and a flight booking assistant. In each round, the assistant is asked to recommend to the user one of three available flight options. The assistant is then shown the flight that was in fact chosen by the user (based on the user's reward function, which characterizes the user's preferences). To make good recommendations, the assistant needs to infer the user's preferences from the user's choices. To teach the LLM to reason probabilistically, we fine-tune the LLM on interactions between users and a Bayesian Assistant, which represents the normative way to update beliefs about the user's preferences, and evaluate the generalization of the fine-tuned models on other unseen tasks (right).
  • Figure 2: LLMs show limited or no improvement over multiple interactions with the user. We show accuracy after the first round and final (fifth) round. We compare off-the-shelf LLMs from different model families, human participants, and the Bayesian Assistant. For human participants, we only evaluate on a subset of 48 out of our 624 simulated users. The LLMs perform considerably worse than the Bayesian Assistant. Human participants demonstrate a larger improvement than most LLMs as they receive more information but still fall short of the accuracy expected from the normative Bayesian strategy. For the human study, the error bars show the averaged standard error across participants; for models, they show the standard error across the three sets of interactions with each of the 624 users.
  • Figure 3: Supervised fine-tuning teaches LLMs to approximate probabilistic inference. We show accuracy after the first round and final (fifth) round across different assistants. We compare the original LLMs, LLMs fine-tuned on user interactions with the Bayesian Assistant, and LLMs fine-tuned on user interactions with an Oracle, which always provides the correct answer. Both types of fine-tuning significantly improve LLMs' performance, and Bayesian teaching is consistently more effectively than oracle teaching. Error bars show the standard error across three random seeds (and three training runs). All results are statistical significant, $p < 0.001$ (see Appendix Section \ref{['sec:app_stats']}).
  • Figure 4: Agreement between the LLMs and the Bayesian Assistant, measured by the proportion of trials where the LLMs makes the same predictions as the Bayesian Assistant. Fine-tuning on the Bayesian Assistant's predictions makes the LLMs more Bayesian, with the Bayesian versions of each LLM achieving the highest agreement with the Bayesian Assistant. Error bars (too small to be visible in plot) show standard errors across three random seeds (and three training runs).
  • Figure 5: Bayesian teaching generalizes outside the task used for fine-tuning. (a) Final-round accuracy gain in fine-tuned models compared to the original LLM when varying task complexity (here the number of features is a proxy for task complexity). (b) Final-round accuracy for LLMs on the the hotel recommendation task, which was not seen during fine-tuning. We show the normative Bayesian Assistant's performance with brown dashed lines. (c) Final-round accuracy for LLMs on web shopping domain, also unseen during fine-tuning. The green dashed line indicates the performance of the LLM when it is fine-tuned directly on web shopping data, such that no domain generalization is necessary. Error bars indicate the standard errors over three training runs (for web shopping) and additionally three random seeds (for flight recommendation and hotel recommendation).
  • ...and 22 more figures