Challenges with unsupervised LLM knowledge discovery
Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah
TL;DR
This paper critically examines unsupervised knowledge discovery in LLMs, focusing on contrast-consistent search (CCS). It provides theoretical results showing CCS can optimize arbitrary features, not just knowledge, and presents extensive experiments across multiple datasets and models demonstrating that CCS often learns distracting features or simulates opinions rather than genuine model knowledge. The work shows CCS and related methods yield results driven by prompt design and prominent features, not principled access to latent knowledge, and proposes sanity checks to evaluate future approaches. Overall, the findings challenge the reliability of current unsupervised knowledge elicitation methods and highlight the need for robust evaluation frameworks and consideration of non-knowledge signals in activation space.
Abstract
We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.
