Recovering Mental Representations from Large Language Models with Markov Chain Monte Carlo
Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths
TL;DR
This work investigates whether adaptive sampling methods can extract human-like mental representations from large language models (LLMs) by treating the model as a participant in a behavioral inference task. By framing color representations as $p(x|c)$ over the HSL color space and applying Direct Prompting, Direct Sampling, MCMC, and Gibbs Sampling with GPT-4, the authors compare the recovered distributions to human data. Adaptive methods (MCMC and Gibbs) outperform static prompting, with MCMC achieving the closest alignment to human representations in both distribution and mode, suggesting that LLMs can participate in Bayesian-like inference when integrated into sampling procedures. The findings point to a general and efficient framework for eliciting and comparing internal representations in LLMs, with potential applications in Bayesian reasoning and AI interpretability beyond color domains.
Abstract
Simulating sampling algorithms with people has proven a useful method for efficiently probing and understanding their mental representations. We propose that the same methods can be used to study the representations of Large Language Models (LLMs). While one can always directly prompt either humans or LLMs to disclose their mental representations introspectively, we show that increased efficiency can be achieved by using LLMs as elements of a sampling algorithm. We explore the extent to which we recover human-like representations when LLMs are interrogated with Direct Sampling and Markov chain Monte Carlo (MCMC). We found a significant increase in efficiency and performance using adaptive sampling algorithms based on MCMC. We also highlight the potential of our method to yield a more general method of conducting Bayesian inference \textit{with} LLMs.
