Table of Contents
Fetching ...

Interpret and Control Dense Retrieval with Sparse Latent Features

Hao Kang, Tevin Wang, Chenyan Xiong

TL;DR

Dense retrieval embeddings provide strong performance but lack interpretability and controllability. The authors propose a k-sparse autoencoder trained with a retrieval-oriented loss that combines reconstruction with a KL-divergence objective to preserve query–document relevance, producing sparse latent features that are faithful for retrieval. Interpretability is obtained via Neuron-to-Graph analysis, revealing meaningful concepts in the latent space, while controllability is demonstrated by amplifying latent features to steer retrieval toward specific perspectives. Empirically, both the sparse latents and their reconstructed embeddings retain near the original retrieval accuracy on MsMarco and BeIR, and controlled experiments show that targeted feature amplification can bias results, offering a transparent mechanism to adjust dense retrieval systems. The approach enables interpretable and controllable dense retrieval with publicly available code for broader adoption.

Abstract

Dense embeddings deliver strong retrieval performance but often lack interpretability and controllability. This paper introduces a novel approach using sparse autoencoders (SAE) to interpret and control dense embeddings via the learned latent sparse features. Our key contribution is the development of a retrieval-oriented contrastive loss, which ensures the sparse latent features remain effective for retrieval tasks and thus meaningful to interpret. Experimental results demonstrate that both the learned latent sparse features and their reconstructed embeddings retain nearly the same retrieval accuracy as the original dense vectors, affirming their faithfulness. Our further examination of the sparse latent space reveals interesting features underlying the dense embeddings and we can control the retrieval behaviors via manipulating the latent sparse features, for example, prioritizing documents from specific perspectives in the retrieval results.

Interpret and Control Dense Retrieval with Sparse Latent Features

TL;DR

Dense retrieval embeddings provide strong performance but lack interpretability and controllability. The authors propose a k-sparse autoencoder trained with a retrieval-oriented loss that combines reconstruction with a KL-divergence objective to preserve query–document relevance, producing sparse latent features that are faithful for retrieval. Interpretability is obtained via Neuron-to-Graph analysis, revealing meaningful concepts in the latent space, while controllability is demonstrated by amplifying latent features to steer retrieval toward specific perspectives. Empirically, both the sparse latents and their reconstructed embeddings retain near the original retrieval accuracy on MsMarco and BeIR, and controlled experiments show that targeted feature amplification can bias results, offering a transparent mechanism to adjust dense retrieval systems. The approach enables interpretable and controllable dense retrieval with publicly available code for broader adoption.

Abstract

Dense embeddings deliver strong retrieval performance but often lack interpretability and controllability. This paper introduces a novel approach using sparse autoencoders (SAE) to interpret and control dense embeddings via the learned latent sparse features. Our key contribution is the development of a retrieval-oriented contrastive loss, which ensures the sparse latent features remain effective for retrieval tasks and thus meaningful to interpret. Experimental results demonstrate that both the learned latent sparse features and their reconstructed embeddings retain nearly the same retrieval accuracy as the original dense vectors, affirming their faithfulness. Our further examination of the sparse latent space reveals interesting features underlying the dense embeddings and we can control the retrieval behaviors via manipulating the latent sparse features, for example, prioritizing documents from specific perspectives in the retrieval results.

Paper Structure

This paper contains 14 sections, 2 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: An overview of our framework. We first train the $k$-sparse autoencoder with our retrieval-oriented contrastive loss, which produces sparse latent features that are effective for retrieval. Next, we interpret these latents using N2G approach and demonstrate controllability via retrieval on the manipulated embeddings.
  • Figure 2: Retrieval performance of reconstructed (Rec.) embeddings and the sparse latent features (Spr.) before and after the contrastive loss KLD is applied on MsMarco using Bge as the embedding model. Results on Beir can be found in Appendix \ref{['appendix:ablation-study']}.
  • Figure 3: Frequency distribution comparison between bag-of-words and sparse latent features in MsMarco using Bge as the embedding model. The high-frequency region is characterized by a small number of words that occur with extreme regularity, whereas the low-frequency region consists of a large proportion of words that appear only a limited number of times throughout the dataset.
  • Figure 4: Improvement in retrieval scores on manipulated documents and queries by amplifying relevant sparse latent features across varying amounts using Bge as the embedding model. The x-axis is in logarithmic scale for better visualizing the trends since each step gets incremented by a factor of 2.
  • Figure 5: Retrieval performance of reconstructed (Rec.) embeddings and the sparse latent features (Spr.) before and after the contrastive loss KLD is applied on Beir using Bge as the embedding model.