Understanding Internal Representations of Recommendation Models with Sparse Autoencoders
Jiayin Wang, Xiaoyu Zhang, Weizhi Ma, Zhiqiang Guo, Min Zhang
TL;DR
This work addresses the need for generalizable model-level interpretability in recommender systems by introducing RecSAE, a probing framework that learns large, sparse latent representations of internal model states to uncover mono-semantic concepts. Using a probing site before the prediction layer, RecSAE reconstructs internal representations with a Top-$K$ activated latent space and auxiliary dead-latent loss, then constructs concepts via TF-IDF and an open-source LLM, validated through confidence scores and human evaluation. Across general, graph-based, and sequential models on four public datasets, RecSAE demonstrates improved latent coherence, richer interpretable concepts, and stable reconstruction performance, enabling targeted tuning of model behavior without altering the original models. The framework shows potential for debiasing and customizing recommendations, with thorough analysis of activation-level effects and practical pathways for broader adoption in industry-scale systems.
Abstract
Recommendation model interpretation aims to reveal the relationships between inputs, model internal representations and outputs to enhance the transparency, interpretability, and trustworthiness of recommendation systems. However, the inherent complexity and opacity of deep learning models pose challenges for model-level interpretation. Moreover, most existing methods for interpreting recommendation models are tailored to specific architectures or model types, limiting their generalizability across different types of recommenders. In this paper, we propose RecSAE, a generalizable probing framework that interprets Recommendation models with Sparse AutoEncoders. The framework extracts interpretable latents from the internal representations of recommendation models, and links them to semantic concepts for interpretations. It does not alter original models during interpretations and also enables targeted tuning to models. Experiments on three types of recommendation models (general, graph-based, sequential) with four widely used public datasets demonstrate the effectiveness and generalization of RecSAE framework. The interpreted concepts are further validated by human experts, showing strong alignment with human perception. Overall, RecSAE serves as a novel step in both model-level interpretations to various types of recommendation models without affecting their functions and offering potential for targeted tuning of models.
