Cluster-norm for Unsupervised Probing of Knowledge
Walter Laurito, Sharan Maiya, Grégoire Dhimoïla, Owen, Yeung, Kaarel Hänni
TL;DR
The paper tackles unsupervised probing of latent knowledge in language models and the problem that distracting salient features can mislead probes. It introduces Cluster-Norm, a cluster-based activation normalization of contrast-pair representations that suppresses non-knowledge features before applying CCS or CRC-TPC. Empirical results on Random Words and Explicit Opinion show substantial gains (e.g., CCS accuracy improvement from $\approx 0.53$ to $0.77$ and CRC-TPC from $\approx 0.51$ to $0.81$ on Mistral-7B), while prompt-sensitivity scenarios reveal limits. The work highlights that while Cluster-Norm improves unsupervised probing, it does not resolve the broader challenge of distinguishing true knowledge from simulated or prompted knowledge, and suggests directions for robust evaluation.
Abstract
The deployment of language models brings challenges in generating reliable information, especially when these models are fine-tuned using human preferences. To extract encoded knowledge without (potentially) biased human labels, unsupervised probing techniques like Contrast-Consistent Search (CCS) have been developed (Burns et al., 2022). However, salient but unrelated features in a given dataset can mislead these probes (Farquhar et al., 2023). Addressing this, we propose a cluster normalization method to minimize the impact of such features by clustering and normalizing activations of contrast pairs before applying unsupervised probing techniques. While this approach does not address the issue of differentiating between knowledge in general and simulated knowledge - a major issue in the literature of latent knowledge elicitation (Christiano et al., 2021) - it significantly improves the ability of unsupervised probes to identify the intended knowledge amidst distractions.
