Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
Ziqian Zhong, Aditi Raghunathan
TL;DR
WeightWatch offers a data-free, weight-space interpretability method for fine-tuned LLMs by analyzing the weight difference $\Delta W = W_{\text{post}} - W_{\text{base}}$ through the top left singular vectors $\{\mathbf{u}_i\}$ of $\Delta W$ to identify newly acquired behaviors. By monitoring the cosine similarity between token activations and these weight-derived directions, the approach detects salient fine-tuning effects such as backdoors and unlearning without access to training data, and can even steer activations orthogonally to suppress unwanted behaviors. The method demonstrates high detection accuracy across backdoor and unlearning scenarios, outperforms activation-based baselines, and extends to open-weight models in-the-wild, including auditing for model-specific fine-tuning priorities (e.g., marketing content, Midjourney prompts, equation solving). It also shows steering capabilities that can recover or modify certain learned behaviors, illustrating both the potential for safer model use and the need for defenses against adaptive adversaries. Overall, WeightWatch contributes a practical, scalable, weight-centered framework for mechanistic understanding, monitoring, and control of LLM behavior in data-constrained or data-private contexts.
Abstract
The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.
