Table of Contents
Fetching ...

Independence Tests for Language Models

Sally Zhu, Ahmed Ahmed, Rohith Kuditipudi, Percy Liang

TL;DR

This work formalizes independence testing between language-model weights, presenting provable p-values in a constrained setting via exchangeable copies and permutation-equivariant training, alongside a robust unconstrained test that aligns hidden activations and supports localization across architectures. The constrained framework leverages a family of test statistics with per-layer aggregation (e.g., $\phi_{U^{(\ell)}}$, $\phi_{H^{(\ell)}}$) and Fisher’s method, while the unconstrained approach introduces $\phi_{MATCH}$ to tolerate architectural changes and adversarial transformations. Empirically, the methods identify dependent pairs among 21 open-weight LLama models, while the unconstrained test demonstrates near-uniform behavior under independence and detects shared components even after pruning or retraining individual layers. This work advances model provenance and IP protection by enabling provable or robust evidence of non-independence without requiring retraining, and provides a foundation for localized discovery of shared substructures across architectures.

Abstract

We consider the following problem: given the weights of two models, can we test whether they were trained independently -- i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model's training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and correctly identify all pairs of non-independent models. Our tests remain effective even if one model was fine-tuned for many tokens. In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work. Instead, we propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture. The test can also do localized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this, empirically we find it behaves as one and reliably identifies non-independent models. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.

Independence Tests for Language Models

TL;DR

This work formalizes independence testing between language-model weights, presenting provable p-values in a constrained setting via exchangeable copies and permutation-equivariant training, alongside a robust unconstrained test that aligns hidden activations and supports localization across architectures. The constrained framework leverages a family of test statistics with per-layer aggregation (e.g., , ) and Fisher’s method, while the unconstrained approach introduces to tolerate architectural changes and adversarial transformations. Empirically, the methods identify dependent pairs among 21 open-weight LLama models, while the unconstrained test demonstrates near-uniform behavior under independence and detects shared components even after pruning or retraining individual layers. This work advances model provenance and IP protection by enabling provable or robust evidence of non-independence without requiring retraining, and provides a foundation for localized discovery of shared substructures across architectures.

Abstract

We consider the following problem: given the weights of two models, can we test whether they were trained independently -- i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model's training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and correctly identify all pairs of non-independent models. Our tests remain effective even if one model was fine-tuned for many tokens. In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work. Instead, we propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture. The test can also do localized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this, empirically we find it behaves as one and reliably identifies non-independent models. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.

Paper Structure

This paper contains 29 sections, 5 theorems, 27 equations, 10 figures, 16 tables, 5 algorithms.

Key Result

Theorem 1

Let $\phi: \Theta \times \Theta \to \mathbb{R}$ be a test statistic and $\Pi \subset \Theta \to \Theta$ be finite. Let $A : \Theta \to \Theta$ be $\Pi$-equivariant and let $P \in \mathcal{P}(\Theta)$ be $\Pi$-invariant. For $\theta_1^0 \sim P$, let $\theta_1 = A(\theta_1^0)$. Let $\theta_2 \in \Thet

Figures (10)

  • Figure 1: Given the weights of two language models, what relationships can we derive? They could be two models trained from scratch (left). Or, one model could be derived from the other: the dependent model could be a fine-tune, a pruned model, or a partially pruned model (right). We present tests to identify such relationships.
  • Figure 2: We enumerate the public Llama-7B models and delineate the sets of dependent model pairs by color.
  • Figure 3: We plot the fraction of the statistic $\phi_\textup{MATCH}$ less than $x \in [0,1)$, aggregated with \ref{['algorithm:fisher']} and not for independent model pairs. Both plots roughly follow the line $y=x$, i.e. a uniform distribution in $[0,1)$ under the null, meaning $\phi_\textup{MATCH}$ empirically acts as a p-value.
  • Figure 4: We evaluate $\phi_\textup{MATCH}^{(i,j)}$, the unconstrained setting statistic, between all pairs of GLU MLPs in Transformer block $i \in \{1, 2, \dots, 32 \}$ of Llama 3.1-8B and Transformer block $j \in \{1, 2, \dots, 28 \}$ of Llama 3.2-3B. Arrows indicate if $\phi_\textup{MATCH}^{(i,j)} <$ 1e-4 and suggest which Transformer blocks of Llama 3.1-8B were kept in the pruning process to initialize Llama 3.2-3B.
  • Figure 5: We align up-projection hidden activations from the first MLPs of Llama 3.1-8B and Llama 3.2-3B using $\ref{['algorithm:speartest']}(H_\text{up}^{(\ell)}(\theta_1),H_\text{up}^{(\ell)}(\theta_2))$ and plot the activation row from Llama 3.2-3B on the x-axis and the matched activation row from Llama 3.1-8B on the y-axis. We see that the weights and activations of Llama 3.2-3B pruned from Llama 3.1-8B were likely uniformly selected.
  • ...and 5 more figures

Theorems & Definitions (16)

  • Definition 1
  • Definition 2: $\Pi$-invariance
  • Definition 3: $\Pi$-equivariance
  • Theorem 1
  • proof
  • Definition 4
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • ...and 6 more