Table of Contents
Fetching ...

Understanding Internal Representations of Recommendation Models with Sparse Autoencoders

Jiayin Wang, Xiaoyu Zhang, Weizhi Ma, Zhiqiang Guo, Min Zhang

TL;DR

This work addresses the need for generalizable model-level interpretability in recommender systems by introducing RecSAE, a probing framework that learns large, sparse latent representations of internal model states to uncover mono-semantic concepts. Using a probing site before the prediction layer, RecSAE reconstructs internal representations with a Top-$K$ activated latent space and auxiliary dead-latent loss, then constructs concepts via TF-IDF and an open-source LLM, validated through confidence scores and human evaluation. Across general, graph-based, and sequential models on four public datasets, RecSAE demonstrates improved latent coherence, richer interpretable concepts, and stable reconstruction performance, enabling targeted tuning of model behavior without altering the original models. The framework shows potential for debiasing and customizing recommendations, with thorough analysis of activation-level effects and practical pathways for broader adoption in industry-scale systems.

Abstract

Recommendation model interpretation aims to reveal the relationships between inputs, model internal representations and outputs to enhance the transparency, interpretability, and trustworthiness of recommendation systems. However, the inherent complexity and opacity of deep learning models pose challenges for model-level interpretation. Moreover, most existing methods for interpreting recommendation models are tailored to specific architectures or model types, limiting their generalizability across different types of recommenders. In this paper, we propose RecSAE, a generalizable probing framework that interprets Recommendation models with Sparse AutoEncoders. The framework extracts interpretable latents from the internal representations of recommendation models, and links them to semantic concepts for interpretations. It does not alter original models during interpretations and also enables targeted tuning to models. Experiments on three types of recommendation models (general, graph-based, sequential) with four widely used public datasets demonstrate the effectiveness and generalization of RecSAE framework. The interpreted concepts are further validated by human experts, showing strong alignment with human perception. Overall, RecSAE serves as a novel step in both model-level interpretations to various types of recommendation models without affecting their functions and offering potential for targeted tuning of models.

Understanding Internal Representations of Recommendation Models with Sparse Autoencoders

TL;DR

This work addresses the need for generalizable model-level interpretability in recommender systems by introducing RecSAE, a probing framework that learns large, sparse latent representations of internal model states to uncover mono-semantic concepts. Using a probing site before the prediction layer, RecSAE reconstructs internal representations with a Top- activated latent space and auxiliary dead-latent loss, then constructs concepts via TF-IDF and an open-source LLM, validated through confidence scores and human evaluation. Across general, graph-based, and sequential models on four public datasets, RecSAE demonstrates improved latent coherence, richer interpretable concepts, and stable reconstruction performance, enabling targeted tuning of model behavior without altering the original models. The framework shows potential for debiasing and customizing recommendations, with thorough analysis of activation-level effects and practical pathways for broader adoption in industry-scale systems.

Abstract

Recommendation model interpretation aims to reveal the relationships between inputs, model internal representations and outputs to enhance the transparency, interpretability, and trustworthiness of recommendation systems. However, the inherent complexity and opacity of deep learning models pose challenges for model-level interpretation. Moreover, most existing methods for interpreting recommendation models are tailored to specific architectures or model types, limiting their generalizability across different types of recommenders. In this paper, we propose RecSAE, a generalizable probing framework that interprets Recommendation models with Sparse AutoEncoders. The framework extracts interpretable latents from the internal representations of recommendation models, and links them to semantic concepts for interpretations. It does not alter original models during interpretations and also enables targeted tuning to models. Experiments on three types of recommendation models (general, graph-based, sequential) with four widely used public datasets demonstrate the effectiveness and generalization of RecSAE framework. The interpreted concepts are further validated by human experts, showing strong alignment with human perception. Overall, RecSAE serves as a novel step in both model-level interpretations to various types of recommendation models without affecting their functions and offering potential for targeted tuning of models.

Paper Structure

This paper contains 40 sections, 5 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: RecSAE Framework. It encodes the inner representations of recommendation models into large and sparse latent space and constructs corresponding concepts with confidence scores based on the relationship between recommendation input, latent activations, and recommendation outputs. Example recommendation models are LightGCN he2020lightgcn and SASRec kang2018self.
  • Figure 2: Confidence Score Distribution across Interpreted Latents on BPRMF, LightGCN and SASRec on Amazon dataset.
  • Figure 3: RecSAE Latent Activation Distribution of SASRec. Cases are categorized as negative (blue) or positive (orange) based on the presence of the corresponding concepts. Results show that positive cases exhibit higher activations in the corresponding latents.
  • Figure 4: Concept Survey on BPRMF (left) and SASRec (right) trained on the four datasets. RecSAE with the LLM interpretation gains more diverse concepts in models with stronger recommendation performance.
  • Figure 5: Concept Quality across Latent Activation Levels on the MovieLens Dataset, BPRMF (left) and SASRec (right). When the latent is activated, the corresponding concept hit ratio in the top-1 recommended item (blue line) is substantially higher than the global average (red line).
  • ...and 4 more figures