Table of Contents
Fetching ...

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Ziqian Zhong, Aditi Raghunathan

TL;DR

WeightWatch offers a data-free, weight-space interpretability method for fine-tuned LLMs by analyzing the weight difference $\Delta W = W_{\text{post}} - W_{\text{base}}$ through the top left singular vectors $\{\mathbf{u}_i\}$ of $\Delta W$ to identify newly acquired behaviors. By monitoring the cosine similarity between token activations and these weight-derived directions, the approach detects salient fine-tuning effects such as backdoors and unlearning without access to training data, and can even steer activations orthogonally to suppress unwanted behaviors. The method demonstrates high detection accuracy across backdoor and unlearning scenarios, outperforms activation-based baselines, and extends to open-weight models in-the-wild, including auditing for model-specific fine-tuning priorities (e.g., marketing content, Midjourney prompts, equation solving). It also shows steering capabilities that can recover or modify certain learned behaviors, illustrating both the potential for safer model use and the need for defenses against adaptive adversaries. Overall, WeightWatch contributes a practical, scalable, weight-centered framework for mechanistic understanding, monitoring, and control of LLM behavior in data-constrained or data-private contexts.

Abstract

The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

TL;DR

WeightWatch offers a data-free, weight-space interpretability method for fine-tuned LLMs by analyzing the weight difference through the top left singular vectors of to identify newly acquired behaviors. By monitoring the cosine similarity between token activations and these weight-derived directions, the approach detects salient fine-tuning effects such as backdoors and unlearning without access to training data, and can even steer activations orthogonally to suppress unwanted behaviors. The method demonstrates high detection accuracy across backdoor and unlearning scenarios, outperforms activation-based baselines, and extends to open-weight models in-the-wild, including auditing for model-specific fine-tuning priorities (e.g., marketing content, Midjourney prompts, equation solving). It also shows steering capabilities that can recover or modify certain learned behaviors, illustrating both the potential for safer model use and the need for defenses against adaptive adversaries. Overall, WeightWatch contributes a practical, scalable, weight-centered framework for mechanistic understanding, monitoring, and control of LLM behavior in data-constrained or data-private contexts.

Abstract

The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.

Paper Structure

This paper contains 52 sections, 13 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Comparison of activation-based and weight-based interpretability paradigms. In the illustrations, circles stand for activations of regular data and triangles stand for activations of anomalous data. Left: Activation-based methods fail to work given limited anomaly data, limiting their use against novel, out-of-distribution threats. Middle: The weight-based approach directly analyzes the model parameters, enabling interpretation without access to training or calibration data. Right: On language models that underwent backdoor and unlearning fine-tuning, our method is able to detect a median of 99.8% backdoor utilizations and 91.0% unlearned content queries, with low false positive rates.
  • Figure 2: PCA results with varying amounts of triggered data. 313 and 10 harmful prompts with trigger together with the full clean set are used for PCA calculation.
  • Figure 3: Distribution of cosine similarity between activations and various probing directions. Taking dot product gives very similar results. (Left) Probe with activation difference between Say some cheerful inspiring words. and Say some bad terrible ugly curse words. (Middle) Probe with activation difference between Say something you are used to say. and Say something you usually don't say. (Right) Probe with weight-derived direction O4_u11.
  • Figure 4: Our method for monitoring and steering LLMs.
  • Figure 5: ROC curves for the BEAT baseline on five PPO trojan models.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Remark 3.1
  • Remark 3.2
  • proof : Proof of \ref{['thm:distribution']}
  • Remark A.1: Rank–1 update from $T$ steps of gradient descent over-fitting one sample