2D Matryoshka Sentence Embeddings

Xianming Li; Zongxi Li; Jing Li; Haoran Xie; Qing Li

2D Matryoshka Sentence Embeddings

Xianming Li, Zongxi Li, Jing Li, Haoran Xie, Qing Li

TL;DR

This work tackles the rigidity of fixed-depth, fixed-size sentence embeddings by introducing Two-dimensional Matryoshka Sentence Embeddings (2DMSE). The method randomizes both Transformer depth and embedding size during training, learning nested representations via matryoshka-style losses and KL-divergence alignment between shallow and last layers. Key contributions include an elastic framework for depth and width, a joint objective combining full- and shallow-layer embeddings, and empirical demonstrations of strong STS performance and notable efficiency gains, including scalability to smaller models. The approach enables deployable, resource-aware sentence embeddings without substantial losses in accuracy, making it well-suited for diverse downstream tasks and budgets.

Abstract

Common approaches rely on fixed-length embedding vectors from language models as sentence embeddings for downstream tasks such as semantic textual similarity (STS). Such methods are limited in their flexibility due to unknown computational constraints and budgets across various applications. Matryoshka Representation Learning (MRL) \cite{aditya2022matryoshka} encodes information at finer granularities, i.e., with lower embedding dimensions, to adaptively accommodate \emph{ad hoc} tasks. Similar accuracy can be achieved with a smaller embedding size, leading to speedups in downstream tasks. Despite its improved efficiency, MRL still requires traversing all Transformer layers before obtaining the embedding, which remains the dominant factor in time and memory consumption. This prompts consideration of whether the fixed number of Transformer layers affects representation quality and whether using intermediate layers for sentence representation is feasible. In this paper, we introduce a novel sentence embedding model called \textit{Two-dimensional Matryoshka Sentence Embedding} (2DMSE)\footnote{Our code is available at \url{https://github.com/SeanLee97/AnglE/blob/main/README_2DMSE.md}.}. It supports elastic settings for both embedding sizes and Transformer layers, offering greater flexibility and efficiency than MRL. We conduct extensive experiments on STS tasks and downstream applications. The experimental results demonstrate the effectiveness of our proposed model in dynamically supporting different embedding sizes and Transformer layers, allowing it to be highly adaptable to various scenarios.

2D Matryoshka Sentence Embeddings

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 4 figures, 3 tables)

This paper contains 22 sections, 7 equations, 4 figures, 3 tables.

Introduction
Related Work
2D Matryoshka Sentence Embeddings Framework $^\mathbf{2}$
Encoder
Scalable Sentence Embedding Learning
Sentence Embedding Alignment $\ \rightarrow$
Joint Learning
Experimental Setup
Datasets.
Evaluation Metrics.
Baselines.
Implementation Details.
Experimental Results
Main Results
Ablation Study
...and 7 more sections

Figures (4)

Figure 1: A visual comparison of various sentence embedding methods. The gray blocks represent Transformer layers fine-tuned with AnglE, which are not optimized for matryoshka representation. The purple block represents Transformer layers fine-tuned with AnglE together with matryoshka loss.
Figure 2: The overall framework of 2DMSE $^\mathbf{2}$. The left box represents the 2DMSE training stage, which involves two random processes: sampling a Transformer layer and sampling a hidden size. The selected layer and the last layer (pink rectangle) are then chosen for sentence embedding learning without scaling the hidden size. The selection of the hidden size (purple dashed rectangle) is also considered for sentence embedding learning. KL divergence is optimized during training to align the shallow layers with the last layer. The right box illustrates the inference stage, where all Transformer layers are scalable and can produce high-quality sentence embeddings for downstream applications after 2DMSE training.
Figure 3: Results of the STS benchmark with a cascade of hidden sizes: $8 \rightarrow 16 \rightarrow 32 \rightarrow 64 \rightarrow 128 \rightarrow 256 \rightarrow 512 \rightarrow 768$ from BERT$_{base}$. The score represents the average Spearman's correlation. BERT$_{base}$ serves as the backbone for all models. The blue $\mathbin{\vcenter{\hbox{$\bullet$}}}$ indicates the results of sentence embeddings from AnglE without any scalable sentence embedding learning. The red $\blacklozenge$ represents the results of matryoshka sentence embeddings. The green $\blacksquare$ denotes the results of our proposed 2D Matryoshka Sentence Embeddings (2DMSE). The layer index $=i$ denotes the $i$-th attention layer.
Figure 4: Subfigure (a) illustrates the time taken to use embeddings from different layers to encode the entire STS benchmarks. Subfigure (b) displays the average Spearman's correlation scores of different layers. Both (a) and (b) use an embedding size of $768$ and the standard STS benchmark dataset.

2D Matryoshka Sentence Embeddings

TL;DR

Abstract

2D Matryoshka Sentence Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (4)