Table of Contents
Fetching ...

Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width

Zheng Liu, Chaofan Li, Shitao Xiao, Chaozhuo Li, Defu Lian, Yingxia Shao

TL;DR

This paper tackles the challenge of achieving high-precision text re-ranking with large language models under practical compute constraints. It introduces Matryoshka Re-Ranker, a runtime-configurable framework that supports depth $n$ and per-layer width $L_i$ customization, augmented by cascaded self-distillation and a factorized compensation mechanism to mitigate accuracy loss across arbitrary sub-structures. Empirical results on MSMARCO and BEIR show that Matryoshka delivers state-of-the-art or competitive re-ranking performance in both full-scale and lightweight configurations, delivering substantial FLOP reductions and speedups with minimal accuracy loss. The work offers a practical, adaptable approach for real-world retrieval systems and provides a foundation for distillation and post-training compensation in flexible, large-model re-rankers.

Abstract

Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due to constraints on computation bandwidth. In this work, we propose a \textbf{flexible} architecture called \textbf{Matroyshka Re-Ranker}, which is designed to facilitate \textbf{runtime customization} of model layers and sequence lengths at each layer based on users' configurations. Consequently, the LLM-based re-rankers can be made applicable across various real-world situations. The increased flexibility may come at the cost of precision loss. To address this problem, we introduce a suite of techniques to optimize the performance. First, we propose \textbf{cascaded self-distillation}, where each sub-architecture learns to preserve a precise re-ranking performance from its super components, whose predictions can be exploited as smooth and informative teacher signals. Second, we design a \textbf{factorized compensation mechanism}, where two collaborative Low-Rank Adaptation modules, vertical and horizontal, are jointly employed to compensate for the precision loss resulted from arbitrary combinations of layer and sequence compression. We perform comprehensive experiments based on the passage and document retrieval datasets from MSMARCO, along with all public datasets from BEIR benchmark. In our experiments, Matryoshka Re-Ranker substantially outperforms the existing methods, while effectively preserving its superior performance across various forms of compression and different application scenarios.

Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width

TL;DR

This paper tackles the challenge of achieving high-precision text re-ranking with large language models under practical compute constraints. It introduces Matryoshka Re-Ranker, a runtime-configurable framework that supports depth and per-layer width customization, augmented by cascaded self-distillation and a factorized compensation mechanism to mitigate accuracy loss across arbitrary sub-structures. Empirical results on MSMARCO and BEIR show that Matryoshka delivers state-of-the-art or competitive re-ranking performance in both full-scale and lightweight configurations, delivering substantial FLOP reductions and speedups with minimal accuracy loss. The work offers a practical, adaptable approach for real-world retrieval systems and provides a foundation for distillation and post-training compensation in flexible, large-model re-rankers.

Abstract

Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due to constraints on computation bandwidth. In this work, we propose a \textbf{flexible} architecture called \textbf{Matroyshka Re-Ranker}, which is designed to facilitate \textbf{runtime customization} of model layers and sequence lengths at each layer based on users' configurations. Consequently, the LLM-based re-rankers can be made applicable across various real-world situations. The increased flexibility may come at the cost of precision loss. To address this problem, we introduce a suite of techniques to optimize the performance. First, we propose \textbf{cascaded self-distillation}, where each sub-architecture learns to preserve a precise re-ranking performance from its super components, whose predictions can be exploited as smooth and informative teacher signals. Second, we design a \textbf{factorized compensation mechanism}, where two collaborative Low-Rank Adaptation modules, vertical and horizontal, are jointly employed to compensate for the precision loss resulted from arbitrary combinations of layer and sequence compression. We perform comprehensive experiments based on the passage and document retrieval datasets from MSMARCO, along with all public datasets from BEIR benchmark. In our experiments, Matryoshka Re-Ranker substantially outperforms the existing methods, while effectively preserving its superior performance across various forms of compression and different application scenarios.

Paper Structure

This paper contains 24 sections, 1 theorem, 10 equations, 4 figures, 4 tables.

Key Result

Theorem 3.1

A sub-structure $\mathcal{N}$ of Matryoshka re-ranker is dominated by its super-architecture $\mathcal{N}'$ in re-ranking precision: $\sigma_{\mathcal{N}'} \succ \sigma_{\mathcal{N}}$.

Figures (4)

  • Figure 1: Matryoshka re-ranker (A) can be directly customized into arbitrary shapes based on users' configurations. In contrast, the conventional method (B) needs to perform ad-hoc pruning of the full-scale model and fine-tune it for each specific scenario.
  • Figure 2: Cascaded Self-Distillation. Upper: full-width layerwise predictions are used as the teacher committee. Lower: students make selective use of teachers to distill knowledge.
  • Figure 3: Factorized compensation mechanism. The vertical (V-LoRA) and horizontal (H-LoRA) compensation modules are selected and added up to make up the precision loss.
  • Figure 4: Re-ranking performance (MRR@10) vs. FLOPs / inference time saving based on different forms of compression.

Theorems & Definitions (1)

  • Theorem 3.1