Table of Contents
Fetching ...

Logit Distance Bounds Representational Similarity

Beatrix M. G. Nielsen, Emanuele Marconato, Luigi Gresele, Andrea Dittadi, Simon Buchholz

TL;DR

The paper introduces a logit-distance framework to study representational similarity in a broad discriminative model class where internal representations are identifiable up to invertible linear transformations. It proves that small logit distance $d_{ ext{logit}}$ implies strong linear similarity (via $m_{ ext{CCA}}$) and small linear-identifiability dissimilarity $d_{ ext{rep}}$, and it derives bounds showing $d_{ ext{logit}}$ controls both embeddings and unembeddings through the unembedding matrices. While KL divergence can bound $d_{ ext{logit}}$ under strong $ au$-lower-boundedness, those bounds are often impractical, motivating logit-distance objectives for distillation. Empirically, distillation using $d_{ ext{logit}}$-based losses (including $L_1$ variants) yields substantially more linearly similar teacher-student representations and better preservation of linearly recoverable concepts than KL-based distillation, across synthetic and real datasets. The results advocate replacing KL with logit-distance objectives in settings where preserving linear interpretable structure is important, with implications for distillation and interpretability research.

Abstract

For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models' identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher's predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher's linearly recoverable concepts.

Logit Distance Bounds Representational Similarity

TL;DR

The paper introduces a logit-distance framework to study representational similarity in a broad discriminative model class where internal representations are identifiable up to invertible linear transformations. It proves that small logit distance implies strong linear similarity (via ) and small linear-identifiability dissimilarity , and it derives bounds showing controls both embeddings and unembeddings through the unembedding matrices. While KL divergence can bound under strong -lower-boundedness, those bounds are often impractical, motivating logit-distance objectives for distillation. Empirically, distillation using -based losses (including variants) yields substantially more linearly similar teacher-student representations and better preservation of linearly recoverable concepts than KL-based distillation, across synthetic and real datasets. The results advocate replacing KL with logit-distance objectives in settings where preserving linear interpretable structure is important, with implications for distillation and interpretability research.

Abstract

For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models' identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher's predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher's linearly recoverable concepts.
Paper Structure (42 sections, 41 theorems, 221 equations, 4 figures, 4 tables)

This paper contains 42 sections, 41 theorems, 221 equations, 4 figures, 4 tables.

Key Result

Theorem 2.2

For two models $({\bm{\mathrm{f}}}, {\bm{\mathrm{g}}}), ({\bm{\mathrm{f}}}', {\bm{\mathrm{g}}}') \in \Theta$ that satisfy the diversity condition (assu:diversity-condition), let $\tilde{y} \in \mathcal{Y}$ and $\mathcal{J} \subseteq \mathcal{Y} \setminus \{\tilde{y}\}$ be a choice of pivot point and Then we have that In particular, we can set ${\bm{\mathrm{A}}} = \tilde{{\bm{\mathrm{A}}}}_{\mathc

Figures (4)

  • Figure 1: In the center: The intuition of bounding representational similarity using distributional distance. $\mathcal{P}_\Theta$ is the set of probability distributions parametrized by models in $\Theta$ (Eq. \ref{['eq:model-class']}) which satisfy \ref{['assu:diversity-condition']}. These distributions are one-to-one with identifiability classes $[({\bm{\mathrm{f}}}, {\bm{\mathrm{g}}})]$ in the quotient space $\Theta / \sim_L$khemakhem2020variational. The colored areas in $\mathcal{P}_\Theta$ contain the distributions which are $\epsilon$-close to a reference $p_{{\bm{\mathrm{f}}}, {\bm{\mathrm{g}}}}$, as measured by $d_\mathrm{logit}$ (\ref{['def:logit_dist']}, blue area) or by $d_\mathrm{KL}$ (Eq. \ref{['eq:d_KL']}, pink area). Our \ref{['th:bound_mcc']} lower-bounds representational similarity (in terms of $m_{\mathrm{CCA}}$, blue arrow) using the logit distance $d_{\mathrm{logit}}$; similarly, \ref{['theorem:norm_dist_bounds_direct_dist']} upper-bounds dissimilarity in terms of our $d_{\mathrm{rep}}$ (\ref{['def:direct_rep_distance']}). In \ref{['theorem:bound_kl']} (pink arrow) we prove that $d_\mathrm{KL}$ yields weak bounds on $d_{\mathrm{logit}}$. We illustrate this with representations of two student models distilled from a teacher on the SUB dataset bader2025sub, see \ref{['sec:exp-concepts']} for details. On the left, a student model trained to minimize a variant of $d_\mathrm{logit}$ (\ref{['eq:l1-logits']}) to the teacher distribution $p_{{\bm{\mathrm{f}}}, {\bm{\mathrm{g}}}}$ preserves linearly encoded concepts (\ref{['thm:linear-concepts-robustness']}): for $6$ attributes, we visualize their linear separability in the embeddings by projecting them to two dimensions through LDA bishop2006pattern. Distinct concept attributes can be well separated linearly in this $2$d subspace. On the right, for a model trained to minimize $d_\mathrm{KL}$ the LDA reduction shows that different concept attributes are not linearly separable, as reflected by the extremely low accuracy in \ref{['tab:results-sub-short']}.
  • Figure 2: On the left, input data of the Synth dataset (\ref{['sec:experiments']}), where inputs are colored based on their labels. The remaining plots display the model embeddings of the teacher and student models. We notice that embeddings of points belonging to class 1 are nearest neighbors to those of class 6 and class 2 for the teacher and similarly for the $L_1$-student, but not for the KL student. Here, the KL-student has low linear similarity to the teacher ($d_\mathrm{rep} \approx 6.9$ and $m_\mathrm{CCA} \approx 0.62$), while the $L_1$ student has higher similarity ($d_\mathrm{rep} \approx 1.3$ and $m_\mathrm{CCA} \approx 0.99$).
  • Figure 3: Embeddings of $({\bm{\mathrm{f}}}, {\bm{\mathrm{g}}})$ in blue. Other coloured dots mark the unembeddings, each belonging to a label.
  • Figure 4: Embeddings of $({\bm{\mathrm{f}}}', {\bm{\mathrm{g}}}')$ in blue. Other coloured dots mark the unembeddings. Note that brown and pink-colored unembeddings are swapped compared to \ref{['fig:model_1_emb_unemb']}, as are grey and yellow.

Theorems & Definitions (83)

  • Theorem 2.2: Linear Identifiability
  • Definition 3.1
  • Theorem 3.3
  • Theorem 3.4
  • Corollary 3.6
  • Definition 3.7
  • Corollary 3.8
  • Theorem 3.9
  • Corollary 3.10
  • Proposition 4.1
  • ...and 73 more