Table of Contents
Fetching ...

Extracting Interaction-Aware Monosemantic Concepts in Recommender Systems

Dor Arviv, Yehonatan Elisha, Oren Barkan, Noam Koenigstein

TL;DR

This work tackles the opacity of latent embeddings in two-tower recommender systems by extracting monosemantic neurons using a Sparse Autoencoder (SAE) trained with a prediction-aware objective that backpropagates through a frozen recommender. The method preserves user–item interaction semantics via a reconstruction loss that combines embedding fidelity and alignment of predicted affinities, supplemented by KL-based sparsity to encourage compact, disentangled representations. Across MF and NCF on MovieLens ML1M and Last.FM, the approach yields neurons that align with genres, popularity, and temporal trends, enabling post hoc interventions such as targeted promotion without retraining. The results demonstrate practical benefits for interpretability, controllability, and content governance, while revealing hierarchical structure through Matryoshka SAEs and preserving recommendation fidelity with balanced sparsity.

Abstract

We present a method for extracting \emph{monosemantic} neurons, defined as latent dimensions that align with coherent and interpretable concepts, from user and item embeddings in recommender systems. Our approach employs a Sparse Autoencoder (SAE) to reveal semantic structure within pretrained representations. In contrast to work on language models, monosemanticity in recommendation must preserve the interactions between separate user and item embeddings. To achieve this, we introduce a \emph{prediction aware} training objective that backpropagates through a frozen recommender and aligns the learned latent structure with the model's user-item affinity predictions. The resulting neurons capture properties such as genre, popularity, and temporal trends, and support post hoc control operations including targeted filtering and content promotion without modifying the base model. Our method generalizes across different recommendation models and datasets, providing a practical tool for interpretable and controllable personalization. Code and evaluation resources are available at https://github.com/DeltaLabTLV/Monosemanticity4Rec.

Extracting Interaction-Aware Monosemantic Concepts in Recommender Systems

TL;DR

This work tackles the opacity of latent embeddings in two-tower recommender systems by extracting monosemantic neurons using a Sparse Autoencoder (SAE) trained with a prediction-aware objective that backpropagates through a frozen recommender. The method preserves user–item interaction semantics via a reconstruction loss that combines embedding fidelity and alignment of predicted affinities, supplemented by KL-based sparsity to encourage compact, disentangled representations. Across MF and NCF on MovieLens ML1M and Last.FM, the approach yields neurons that align with genres, popularity, and temporal trends, enabling post hoc interventions such as targeted promotion without retraining. The results demonstrate practical benefits for interpretability, controllability, and content governance, while revealing hierarchical structure through Matryoshka SAEs and preserving recommendation fidelity with balanced sparsity.

Abstract

We present a method for extracting \emph{monosemantic} neurons, defined as latent dimensions that align with coherent and interpretable concepts, from user and item embeddings in recommender systems. Our approach employs a Sparse Autoencoder (SAE) to reveal semantic structure within pretrained representations. In contrast to work on language models, monosemanticity in recommendation must preserve the interactions between separate user and item embeddings. To achieve this, we introduce a \emph{prediction aware} training objective that backpropagates through a frozen recommender and aligns the learned latent structure with the model's user-item affinity predictions. The resulting neurons capture properties such as genre, popularity, and temporal trends, and support post hoc control operations including targeted filtering and content promotion without modifying the base model. Our method generalizes across different recommendation models and datasets, providing a practical tool for interpretable and controllable personalization. Code and evaluation resources are available at https://github.com/DeltaLabTLV/Monosemanticity4Rec.

Paper Structure

This paper contains 26 sections, 6 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Framework architecture. Solid black arrows indicate forward pass; dashed red arrows show gradient flow. Our innovation adapts monosemanticity to recommender systems by backpropagating a novel prediction-level loss through a frozen recommender, thus preserving user-item interaction semantics.
  • Figure 2: Representative monosemantic neurons extracted from our SAE bottleneck. Top row: Mean activation of all neurons for items from two genres (Children, Horror), revealing sharp peaks at genre-aligned units. Middle row: Mean activation of two genre-selective neurons (Sci-Fi, Comedy) across items from various genres, showing strong intra-neuron selectivity. Bottom row: Tag distribution among the top-50 activating artists for a neuron in the Last.FM dataset, highlighting alignment with electronic music (e.g., dance, house, techno). Together, these examples illustrate the emergence of interpretable, concept-specific neurons across domains.
  • Figure 3: Temporal specialization of four NCF neurons. Each plot shows the decade-wise distribution of top-activating movies, revealing sharp alignment with stylistic eras, e.g., 1990s Thrillers, 1980s Comedies, and Golden Age films).
  • Figure 4: Effect of the prediction-aware loss $\mathcal{L}_{pred}$ on recommendation fidelity and interpretability. Left and center: Rank-Biased Overlap (RBO) and Kendall-Tau correlation between the original and reconstructed top-30 recommendation lists improve with increasing weight $\beta$. Right: Monosemanticity score pach2025sparse peaks at intermediate values, highlighting a trade-off between fidelity and sparsity. Notably, $\beta{=}0$ corresponds to an ablation without $\mathcal{L}_{pred}$, underscoring its importance for alignment.
  • Figure 5: Targeted item promotion in Last.FM via neuron-level intervention. By increasing the activation of a genre-aligned neuron in Bob Dylan’s embedding vector $\mathbf{z}$ (x-axis), the artist becomes relevant to users who prefer Metal, Contemporary Pop, and Electronic music, appearing in their top-30 recommendations despite no prior affinity.