Table of Contents
Fetching ...

MonoLoss: A Training Objective for Interpretable Monosemantic Representations

Ali Nasiri-Sarvi, Anh Tien Nguyen, Hassan Rivaz, Dimitris Samaras, Mahdi S. Hosseini

TL;DR

A recent MonoScore metric is studied and a single-pass algorithm is derived that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images.

Abstract

Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs trained on CLIP, SigLIP2, and pretrained ViT features, using BatchTopK, TopK, and JumpReLU SAEs, MonoLoss increases MonoScore for most latents. MonoLoss also consistently improves class purity (the fraction of a latent's activating images belonging to its dominant class) across all encoder and SAE combinations, with the largest gain raising baseline purity from 0.152 to 0.723. Used as an auxiliary regularizer during ResNet-50 and CLIP-ViT-B/32 finetuning, MonoLoss yields up to 0.6\% accuracy gains on ImageNet-1K and monosemantic activating patterns on standard benchmark datasets. The code is publicly available at https://github.com/AtlasAnalyticsLab/MonoLoss.

MonoLoss: A Training Objective for Interpretable Monosemantic Representations

TL;DR

A recent MonoScore metric is studied and a single-pass algorithm is derived that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images.

Abstract

Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs trained on CLIP, SigLIP2, and pretrained ViT features, using BatchTopK, TopK, and JumpReLU SAEs, MonoLoss increases MonoScore for most latents. MonoLoss also consistently improves class purity (the fraction of a latent's activating images belonging to its dominant class) across all encoder and SAE combinations, with the largest gain raising baseline purity from 0.152 to 0.723. Used as an auxiliary regularizer during ResNet-50 and CLIP-ViT-B/32 finetuning, MonoLoss yields up to 0.6\% accuracy gains on ImageNet-1K and monosemantic activating patterns on standard benchmark datasets. The code is publicly available at https://github.com/AtlasAnalyticsLab/MonoLoss.
Paper Structure (20 sections, 1 theorem, 17 equations, 41 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 1 theorem, 17 equations, 41 figures, 4 tables, 2 algorithms.

Key Result

Proposition 3.1

Let $MS^{\text{pair}} \in \mathbb{R}^M$ denote the MonoScore computed by Algorithm alg:monoscore-baseline and $MS^{\text{lin}} \in \mathbb{R}^M$ the MonoScore computed by Algorithm alg:monoscore-linear. Then for every neuron $k \in \{1,\dots,M\}$,

Figures (41)

  • Figure 1: From multi-semantic images, an encoder produces latent feature dimensions. Without MonoLoss, a single dimension tends to be polysemantic, activating for disparate concepts. With MonoLoss, dimensions consistently response to one concept, yielding coherent and interpretable monosemantic features.
  • Figure 2: Wall-clock time to compute MonoScore as a function of dataset size $N$ (log--log scale), benchmarked on an NVIDIA H100 GPU. The baseline implementation (Algorithm \ref{['alg:monoscore-baseline']}) clearly exhibits $O(N^2)$ quadratic scaling. In contrast, our linear-time formulation (Algorithm \ref{['alg:monoscore-linear']}) scales linearly ($O(N)$). Green annotations highlight the speedup factor, which reaches 1234$\times$ at $N=2^{16}$, the largest point evaluated for the prohibitively slow baseline.
  • Figure 3: Comparison of activation patterns with and without MonoLoss for a BatchTopK autoencoder on CLIP-image features. Both settings show latent 6453, ranked 426 without MonoLoss and 592 with MonoLoss by validation-set monosemanticity. Each row displays the top-5, middle-5, and bottom-5 activated samples of this latent, ordered by activation strength from high to low. Without MonoLoss, the dimension appears monosemantic with respect to polar-bear/ice-related only for the strongest activations, while weaker ones include unrelated concepts like sports and food. With MonoLoss, the same latent mostly focuses on polar bears and bears across top, middle, and bottom activations, indicating better monosemanticity over the full range of activation strengths.
  • Figure 4: $R^2$ and MonoScore for SAEs trained with and without MonoLoss across four architectures (BatchTopK, TopK, JumpReLU, Vanilla ReLU), three vision encoders (CLIP, SigLIP2, ViT Supervised), and two datasets (Open Images V7 test set, ImageNet-1K validation set). MonoLoss consistently raises monosemanticity across nearly all configurations, with BatchTopK exhibiting minimal reconstruction drops. Supervised ViT shows the most pronounced interpretability--reconstruction trade-off, but also achieves the largest absolute gains in MonoScore. Notably, several encoder--architecture pairs on ImageNet-1K (BatchTopK--SigLIP2, TopK--CLIP, TopK--ViT) show slight $R^2$ improvements with MonoLoss.
  • Figure 5: Comparison of activation patterns with and without monosemanticity loss for BatchTopK autoencoders on CLIP-image features. The two rows show latents of the rank ($392$ out of $8192$) in the no-MonoLoss and MonoLoss settings, respectively. For each latent, we show top, middle, and bottom positively activated samples, ordered by activation strength from high to low. The latent from the baseline (no MonoLoss) targets "surgeons and operating rooms" but rapidly loses coherence, firing on unrelated images like "shoes, x-rays, and soccer" in weaker activations. In contrast, the latent from the model trained with MonoLoss remains highly coherent, consistently identifying "cats" across its full dynamic range.
  • ...and 36 more figures

Theorems & Definitions (2)

  • Proposition 3.1: Equivalence of MonoScore algorithms
  • proof : Proof of Proposition \ref{['prop:monoscore-equivalence']}