Table of Contents
Fetching ...

Compressed Concatenation of Small Embedding Models

Mohamed Ayoub Ben Ayad, Michael Dinzinger, Kanishka Ghosh Dastidar, Jelena Mitrovic, Michael Granitzer

TL;DR

The paper tackles the deployment bottleneck of embedding models in resource-constrained settings by proposing to concatenate multiple small frozen embeddings to broaden semantic coverage. A lightweight decoder trained with a Matryoshka Representation Learning (MRL) objective maps the high-dimensional concatenation to a compact space while preserving pairwise cosine similarities, enabling efficient retrieval with minimal fine-tuning. Key contributions include: (1) demonstrating that multi-model concatenation can outperform larger single-model baselines, (2) introducing an MRL-trained decoder that compresses the joint representation with near-original performance, and (3) showing that robustness under extreme compression improves with more base models. The approach yields substantial practical impact, recovering up to $89\%$ of the original retrieval performance at a $48\times$ compression when using four small embedding models, offering an efficient alternative to scaling up model size.

Abstract

Embedding models are central to dense retrieval, semantic search, and recommendation systems, but their size often makes them impractical to deploy in resource-constrained environments such as browsers or edge devices. While smaller embedding models offer practical advantages, they typically underperform compared to their larger counterparts. To bridge this gap, we demonstrate that concatenating the raw embedding vectors of multiple small models can outperform a single larger baseline on standard retrieval benchmarks. To overcome the resulting high dimensionality of naive concatenation, we introduce a lightweight unified decoder trained with a Matryoshka Representation Learning (MRL) loss. This decoder maps the high-dimensional joint representation to a low-dimensional space, preserving most of the original performance without fine-tuning the base models. We also show that while concatenating more base models yields diminishing gains, the robustness of the decoder's representation under compression and quantization improves. Our experiments show that, on a subset of MTEB retrieval tasks, our concat-encode-quantize pipeline recovers 89\% of the original performance with a 48x compression factor when the pipeline is applied to a concatenation of four small embedding models.

Compressed Concatenation of Small Embedding Models

TL;DR

The paper tackles the deployment bottleneck of embedding models in resource-constrained settings by proposing to concatenate multiple small frozen embeddings to broaden semantic coverage. A lightweight decoder trained with a Matryoshka Representation Learning (MRL) objective maps the high-dimensional concatenation to a compact space while preserving pairwise cosine similarities, enabling efficient retrieval with minimal fine-tuning. Key contributions include: (1) demonstrating that multi-model concatenation can outperform larger single-model baselines, (2) introducing an MRL-trained decoder that compresses the joint representation with near-original performance, and (3) showing that robustness under extreme compression improves with more base models. The approach yields substantial practical impact, recovering up to of the original retrieval performance at a compression when using four small embedding models, offering an efficient alternative to scaling up model size.

Abstract

Embedding models are central to dense retrieval, semantic search, and recommendation systems, but their size often makes them impractical to deploy in resource-constrained environments such as browsers or edge devices. While smaller embedding models offer practical advantages, they typically underperform compared to their larger counterparts. To bridge this gap, we demonstrate that concatenating the raw embedding vectors of multiple small models can outperform a single larger baseline on standard retrieval benchmarks. To overcome the resulting high dimensionality of naive concatenation, we introduce a lightweight unified decoder trained with a Matryoshka Representation Learning (MRL) loss. This decoder maps the high-dimensional joint representation to a low-dimensional space, preserving most of the original performance without fine-tuning the base models. We also show that while concatenating more base models yields diminishing gains, the robustness of the decoder's representation under compression and quantization improves. Our experiments show that, on a subset of MTEB retrieval tasks, our concat-encode-quantize pipeline recovers 89\% of the original performance with a 48x compression factor when the pipeline is applied to a concatenation of four small embedding models.

Paper Structure

This paper contains 22 sections, 4 equations, 4 tables.