Bayesian Concept Bottleneck Models with LLM Priors
Jean Feng, Avni Kothari, Luke Zier, Chandan Singh, Yan Shuo Tan
TL;DR
This work introduces BC-LLM, a Bayesian concept bottleneck framework that leverages LLM priors to iteratively discover interpretable concepts for predictive tasks. By wrapping LLMs in a split-sample Metropolis-within-Gibbs sampler, the method balances efficient concept proposals with principled posterior inference, enabling uncertainty quantification even when the concept space is effectively infinite. Empirical results across image, text, and tabular domains show BC-LLM often outperforms fully interpretable baselines and can match or exceed some black-box models, while providing interpretable, actionable concepts and robust OOD handling. The approach offers a scalable path to safer, auditable AI in high-stakes settings such as healthcare and biology, with clear avenues for future improvement in speed and scalability.
Abstract
Concept Bottleneck Models (CBMs) have been proposed as a compromise between white-box and black-box models, aiming to achieve interpretability without sacrificing accuracy. The standard training procedure for CBMs is to predefine a candidate set of human-interpretable concepts, extract their values from the training data, and identify a sparse subset as inputs to a transparent prediction model. However, such approaches are often hampered by the tradeoff between exploring a sufficiently large set of concepts versus controlling the cost of obtaining concept extractions, resulting in a large interpretability-accuracy tradeoff. This work investigates a novel approach that sidesteps these challenges: BC-LLM iteratively searches over a potentially infinite set of concepts within a Bayesian framework, in which Large Language Models (LLMs) serve as both a concept extraction mechanism and prior. Even though LLMs can be miscalibrated and hallucinate, we prove that BC-LLM can provide rigorous statistical inference and uncertainty quantification. Across image, text, and tabular datasets, BC-LLM outperforms interpretable baselines and even black-box models in certain settings, converges more rapidly towards relevant concepts, and is more robust to out-of-distribution samples.
