Table of Contents
Fetching ...

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control

Anton Klenitskiy, Konstantin Polev, Daria Denisova, Alexey Vasilev, Dmitry Simakov, Gleb Gusev

TL;DR

This work extends SAE to sequential recommender systems and proposes a framework for interpreting and controlling model representations, and shows that this approach can be successfully applied to the transformer trained on a sequential recommendation task.

Abstract

Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their internals can help understand, influence, and control their behavior, which is very important in a variety of real-world applications. Recently, sparse autoencoders (SAE) have been shown to be a promising unsupervised approach to extract interpretable features from neural networks. In this work, we extend SAE to sequential recommender systems and propose a framework for interpreting and controlling model representations. We show that this approach can be successfully applied to the transformer trained on a sequential recommendation task: directions learned in such an unsupervised regime turn out to be more interpretable and monosemantic than the original hidden state dimensions. Further, we demonstrate a straightforward way to effectively and flexibly control the model's behavior, giving developers and users of recommendation systems the ability to adjust their recommendations to various custom scenarios and contexts.

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control

TL;DR

This work extends SAE to sequential recommender systems and proposes a framework for interpreting and controlling model representations, and shows that this approach can be successfully applied to the transformer trained on a sequential recommendation task.

Abstract

Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their internals can help understand, influence, and control their behavior, which is very important in a variety of real-world applications. Recently, sparse autoencoders (SAE) have been shown to be a promising unsupervised approach to extract interpretable features from neural networks. In this work, we extend SAE to sequential recommender systems and propose a framework for interpreting and controlling model representations. We show that this approach can be successfully applied to the transformer trained on a sequential recommendation task: directions learned in such an unsupervised regime turn out to be more interpretable and monosemantic than the original hidden state dimensions. Further, we demonstrate a straightforward way to effectively and flexibly control the model's behavior, giving developers and users of recommendation systems the ability to adjust their recommendations to various custom scenarios and contexts.

Paper Structure

This paper contains 27 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The schema of SAE for sequential recommendations.
  • Figure 2: Correlation between genres and top feature (the feature with maximum correlation) for each genre. One row corresponds to one feature and contains its correlations with all genres.
  • Figure 3: Dependency of the SAE interpretability metric (mean correlation) on the SAE parameters. For the L1 plot, the dictionary size is set to 2048, and for the dictionary size plot, the L1 is set to 0.1.
  • Figure 4: Comparison between interpretability of SAE features and neurons of the original transformer layer. The maximum correlation for each genre is blue for the transformer layer and orange for SAE. For the Music4all dataset, the genres are sorted by their popularity in the dataset.
  • Figure 5: Distribution of the activation values for SAE features corresponding to a given genre. The orange color corresponds to cases when the genre is present; the blue color corresponds to cases when the genre is absent.
  • ...and 10 more figures