BayesAgent: Bayesian Agentic Reasoning Under Uncertainty via Verbalized Probabilistic Graphical Modeling
Hengguan Huang, Xing Shen, Songtao Wang, Lingfa Meng, Dianbo Liu, David Alejandro Duchene, Hao Wang, Samir Bhatt
TL;DR
This work addresses the challenge of capturing latent structure and uncertainty in LLM-based agent reasoning. It introduces Verbalized Probabilistic Graphical Modeling (vPGM), a Bayesian framework that guides LLMs to discover latent variables and dependencies through prompts and to perform prompting-based Bayesian inference, with predictions computed as $E_{P(Z|X)}[P(Y|Z)]$. The Bayesian-Enhanced variant, BayesVPGM, adds a Dirichlet posterior over predictions and a differentiable calibration loss to optimize a balancing parameter $\lambda$, improving confidence calibration. Across ScienceQA, ChatCoach, and A-OKVQA, the approach yields higher accuracy and better calibration than strong baselines, demonstrating scalable, uncertainty-aware latent-variable reasoning in multi-source, open-ended tasks.
Abstract
Human cognition excels at transcending sensory input and forming latent representations that structure our understanding of the world. While Large Language Model (LLM) agents demonstrate emergent reasoning and decision-making abilities, they lack a principled framework for capturing latent structures and modeling uncertainty. In this work, we explore for the first time how to bridge LLM agents with probabilistic graphical models (PGMs) to address agentic reasoning under uncertainty. To this end, we introduce Verbalized Probabilistic Graphical Modeling (vPGM), a Bayesian agentic framework that (i) guides LLM agents in following key principles of PGMs through natural language and (ii) refines the resulting posterior distributions via numerical Bayesian inference. Unlike many traditional probabilistic methods requiring substantial domain expertise, vPGM bypasses expert-driven model design, making it well-suited for scenarios with limited assumptions. We evaluated our model on several agentic reasoning tasks, both close-ended and open-ended. Our results indicate that the model effectively enhances confidence calibration and text generation quality.
