Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation

Tiansheng Wen; Yifei Wang; Zequn Zeng; Zhong Peng; Yudi Su; Xinyang Liu; Bo Chen; Hongwei Liu; Stefanie Jegelka; Chenyu You

Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation

Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You

TL;DR

The paper addresses the need for adaptive, high-fidelity representations that balance accuracy and retrieval efficiency. It introduces Contrastive Sparse Representation (CSR), a sparse-coding, post-training framework built on frozen pre-trained embeddings and optimized with reconstruction and non-negative contrastive losses. CSR consistently outperforms Matryoshka Representation Learning (MRL) across vision, text, and multimodal benchmarks, delivering near-full-representation performance at substantially reduced training and inference costs. The work demonstrates CSR’s practical potential for large-scale retrieval systems, though it notes ongoing challenges with dead latents in some alignment spaces and suggests future refinements.

Abstract

Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available at https://github.com/neilwen987/CSR_Adaptive_Rep

Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation

TL;DR

Abstract

Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (1)