Cluster-norm for Unsupervised Probing of Knowledge

Walter Laurito; Sharan Maiya; Grégoire Dhimoïla; Owen; Yeung; Kaarel Hänni

Cluster-norm for Unsupervised Probing of Knowledge

Walter Laurito, Sharan Maiya, Grégoire Dhimoïla, Owen, Yeung, Kaarel Hänni

TL;DR

The paper tackles unsupervised probing of latent knowledge in language models and the problem that distracting salient features can mislead probes. It introduces Cluster-Norm, a cluster-based activation normalization of contrast-pair representations that suppresses non-knowledge features before applying CCS or CRC-TPC. Empirical results on Random Words and Explicit Opinion show substantial gains (e.g., CCS accuracy improvement from $\approx 0.53$ to $0.77$ and CRC-TPC from $\approx 0.51$ to $0.81$ on Mistral-7B), while prompt-sensitivity scenarios reveal limits. The work highlights that while Cluster-Norm improves unsupervised probing, it does not resolve the broader challenge of distinguishing true knowledge from simulated or prompted knowledge, and suggests directions for robust evaluation.

Abstract

The deployment of language models brings challenges in generating reliable information, especially when these models are fine-tuned using human preferences. To extract encoded knowledge without (potentially) biased human labels, unsupervised probing techniques like Contrast-Consistent Search (CCS) have been developed (Burns et al., 2022). However, salient but unrelated features in a given dataset can mislead these probes (Farquhar et al., 2023). Addressing this, we propose a cluster normalization method to minimize the impact of such features by clustering and normalizing activations of contrast pairs before applying unsupervised probing techniques. While this approach does not address the issue of differentiating between knowledge in general and simulated knowledge - a major issue in the literature of latent knowledge elicitation (Christiano et al., 2021) - it significantly improves the ability of unsupervised probes to identify the intended knowledge amidst distractions.

Cluster-norm for Unsupervised Probing of Knowledge

TL;DR

and CRC-TPC from

on Mistral-7B), while prompt-sensitivity scenarios reveal limits. The work highlights that while Cluster-Norm improves unsupervised probing, it does not resolve the broader challenge of distinguishing true knowledge from simulated or prompted knowledge, and suggests directions for robust evaluation.

Abstract

Paper Structure (33 sections, 6 equations, 35 figures, 8 tables)

This paper contains 33 sections, 6 equations, 35 figures, 8 tables.

Introduction
Background
Contrast-Consistent Search (CCS)
Contrastive Representation Clustering
Theoretical Background
Method
Experiments
Random Words
Dataset
Training and Results
Explicit Opinion
Dataset
Training & Results
Prompt Template Sensitivity
Datasets
...and 18 more sections

Figures (35)

Figure 1: Under the standard CCS approach (left; Burns-Norm), a modified prompt including distracting random words (red) causes all CCS probes to achieve random accuracy against ground truth labels (GT), and high accuracy against these random word labels (random). When using our approach of cluster normalization (right; Cluster-Norm), the average accuracy of CCS probes for the desired feature increases significantly.
Figure 2: Mean accuracy of Logistic Regression, CRC-TPC, and CCS probes across six different models for (top) unmodified prompts and (bottom) modified prompts with distracting random words. Especially for the modified prompts, unsupervised methods using our Cluster-Normalization consistently outperform standard Burns-Normalization across the 25th, 50th, and 75th percentile layers and the final layer.
Figure 3: Visualization of the top three principal components (PCs) of the normalized contrast pair differences $\widetilde{\mathcal{M}}(x_i^+) - \widetilde{\mathcal{M}}(x_i^-)$ - with normalization performed either over the entire dataset (left) or per cluster (right) - for the random words experiment. Points are colored orange or blue based on the ground truth label (positive/negative) and shaded light or dark based on the appended random word (banana/shed). For each subfigure, we compare PCA projections using default prompts, where no random words are appended (left) against modified prompts, where random words like "banana" / "shed" are appended (right). On the left we note the first PC classifies the undesired random word feature (light vs dark). On the right, using cluster normalization, we find the first PC classifies the desired knowledge feature (orange vs blue).
Figure 4: Discovering an explicit opinion with Mistral-7B. Accuracy when using the default prompt (blue) vs a modified prompt with the opinion of fictional Alice (red), evaluated against the ground truth sentiment labels (dark) and labels of Alice (light). Under the standard CCS approach (Burns-Norm) the case of the modified prompt, evaluated against ground truth labels (dark red) has most CCS probes achieve random accuracy. When using cluster normalization, we find this average accuracy increases.
Figure 5: Variation in probe accuracy when investigating prompt template sensitivity using the CommonClaim dataset, for Mistral-7B. In the default setting (blue), when compared to the literal (red) and professor (green) settings, we see a slightly more varied spread in probe accuracy, regardless of the use of cluster normalization.
...and 30 more figures

Cluster-norm for Unsupervised Probing of Knowledge

TL;DR

Abstract

Cluster-norm for Unsupervised Probing of Knowledge

Authors

TL;DR

Abstract

Table of Contents

Figures (35)