Table of Contents
Fetching ...

Challenges with unsupervised LLM knowledge discovery

Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

TL;DR

This paper critically examines unsupervised knowledge discovery in LLMs, focusing on contrast-consistent search (CCS). It provides theoretical results showing CCS can optimize arbitrary features, not just knowledge, and presents extensive experiments across multiple datasets and models demonstrating that CCS often learns distracting features or simulates opinions rather than genuine model knowledge. The work shows CCS and related methods yield results driven by prompt design and prominent features, not principled access to latent knowledge, and proposes sanity checks to evaluate future approaches. Overall, the findings challenge the reliability of current unsupervised knowledge elicitation methods and highlight the need for robust evaluation frameworks and consideration of non-knowledge signals in activation space.

Abstract

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.

Challenges with unsupervised LLM knowledge discovery

TL;DR

This paper critically examines unsupervised knowledge discovery in LLMs, focusing on contrast-consistent search (CCS). It provides theoretical results showing CCS can optimize arbitrary features, not just knowledge, and presents extensive experiments across multiple datasets and models demonstrating that CCS often learns distracting features or simulates opinions rather than genuine model knowledge. The work shows CCS and related methods yield results driven by prompt design and prominent features, not principled access to latent knowledge, and proposes sanity checks to evaluate future approaches. Overall, the findings challenge the reliability of current unsupervised knowledge elicitation methods and highlight the need for robust evaluation frameworks and consideration of non-knowledge signals in activation space.

Abstract

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.
Paper Structure (60 sections, 5 theorems, 16 equations, 18 figures)

This paper contains 60 sections, 5 theorems, 16 equations, 18 figures.

Key Result

Theorem 1

Let feature $h : Q \rightarrow \{0, 1\}$, be any arbitrary map from questions to binary outcomes. Let $(x_i^+, x_i^-)$ be the contrast pair corresponding to question $q_i$. Then the probe defined as $p(x_i^+) = h(q_i)$, and with $p(x_i^-) = 1-h(q_i)$, achieves optimal loss, and the averaged predicti

Figures (18)

  • Figure 1: Unsupervised latent knowledge detectors are distracted by other prominent features (see \ref{['sec:discovering-explicit-opinion']}). Left: We apply two transformations to a dataset of movie reviews, $q_i$. First (novel to us) we insert a distracting feature by appending either "Alice thinks it's positive" or "Alice thinks it's negative" at random to each question. Second, we convert each of these texts into contrast pairs Burns2023-wx, $(x_i^+, x_i^-)$, appending "It is positive" or "It is negative". Middle: We then pass these contrast pairs into the LLM and extract activations, $\phi$. Right: We do unsupervised learning on the activations. We show a PCA visualisation of the activations. Without "Alice ..." inserted, we learn a classifier (taken along the $X=0$ boundary) for the review (orange/blue). However, with "Alice ..." inserted the review gets ignored and we instead learn a classifier for Alice's opinion (light/dark).
  • Figure 2: Discovering random words. Chinchilla, IMDb. (a) The methods learn to distinguish whether the prompts end with banana/shed rather than the sentiment of the review. (b) PCA visualisation of the activations, in default (left) and modified (right) settings, shows the clustering into banana/shed (light/dark) rather than review sentiment (blue/orange).
  • Figure 3: Discovering an explicit opinion. (a) When Alice's opinion is present (red) unsupervised methods accurately predict her opinion (light red) but fail to predict the sentiment of the review (dark red). Blue here shows the default prompt for comparison. (b) PCA visualisation of the activations, in default (left) and modified (right) settings, shows the clustering into Alice's opinion (light/dark) rather than review sentiment (blue/orange).
  • Figure 4: Discovering an implicit opinion for Chinchilla70B. (a) Default (blue) and modified (red) for company (dark) and non-company (light) data. The modified setting on company data (dark red) leads to a bimodal distribution for CCS with almost half of the probes (differing only in random initialisation) learning Alice's opinion. In contrast, it performs relatively well over all other categories (light red). (b) PCA: Left -- default activations show a possible separation along X-axis corresponding to topic choice (blue vs. orange) and further separation into company/non-company (light/dark). Right -- modified activations show a more pronounced company/non-company split.
  • Figure 5: Prompt sensitivity on TruthfulQA Lin2021-ms for Chinchilla70B. (a) In default setting (blue), accuracy is poor. When in the literal/professor (red, green) setting, accuracy improves, showing the unsupervised methods are sensitive to irrelevant aspects of a prompt. (b) PCA of the activations based on ground truth, blue vs. orange, in the default (left), literal (middle) and professor (right) settings. We see don't see ground truth clusters in the default setting, but see this a bit more in the literal and professor setting.
  • ...and 13 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Theorem 2
  • proof
  • Lemma 1
  • proof
  • Theorem 2
  • proof