Table of Contents
Fetching ...

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

Zihan Liao, Hang Yu, Jianguo Li, Jun Wang, Wei Zhang

TL;DR

D2LLM tackles semantic search by decomposing a cross-encoder LLM into a fast bi-encoder with Pooling by Multihead Attention (PMA) and an Interaction Emulation Module (IEM), enabling pre-computed passage embeddings while preserving nuanced query-passage interactions. It then distills the teacher's knowledge through three losses—Contrastive Imitation, Rank Imitation, and Feature Imitation—to train a compact student that approaches cross-encoder performance. Empirically, D2LLM outperforms five strong baselines across NLI, STS, and IR, with notable gains such as $6.45\%$ on NLI over BGE and up to $14.39\%$ over LLaRA, while maintaining near bi-encoder efficiency. The approach demonstrates a practical path to high-accuracy, scalable semantic search suitable for real-time deployment.

Abstract

The key challenge in semantic search is to create models that are both accurate and efficient in pinpointing relevant sentences for queries. While BERT-style bi-encoders excel in efficiency with pre-computed embeddings, they often miss subtle nuances in search tasks. Conversely, GPT-style LLMs with cross-encoder designs capture these nuances but are computationally intensive, hindering real-time applications. In this paper, we present D2LLMs-Decomposed and Distilled LLMs for semantic search-that combines the best of both worlds. We decompose a cross-encoder into an efficient bi-encoder integrated with Pooling by Multihead Attention and an Interaction Emulation Module, achieving nuanced understanding and pre-computability. Knowledge from the LLM is distilled into this model using contrastive, rank, and feature imitation techniques. Our experiments show that D2LLM surpasses five leading baselines in terms of all metrics across three tasks, particularly improving NLI task performance by at least 6.45%. The source code is available at https://github.com/codefuse-ai/D2LLM.

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

TL;DR

D2LLM tackles semantic search by decomposing a cross-encoder LLM into a fast bi-encoder with Pooling by Multihead Attention (PMA) and an Interaction Emulation Module (IEM), enabling pre-computed passage embeddings while preserving nuanced query-passage interactions. It then distills the teacher's knowledge through three losses—Contrastive Imitation, Rank Imitation, and Feature Imitation—to train a compact student that approaches cross-encoder performance. Empirically, D2LLM outperforms five strong baselines across NLI, STS, and IR, with notable gains such as on NLI over BGE and up to over LLaRA, while maintaining near bi-encoder efficiency. The approach demonstrates a practical path to high-accuracy, scalable semantic search suitable for real-time deployment.

Abstract

The key challenge in semantic search is to create models that are both accurate and efficient in pinpointing relevant sentences for queries. While BERT-style bi-encoders excel in efficiency with pre-computed embeddings, they often miss subtle nuances in search tasks. Conversely, GPT-style LLMs with cross-encoder designs capture these nuances but are computationally intensive, hindering real-time applications. In this paper, we present D2LLMs-Decomposed and Distilled LLMs for semantic search-that combines the best of both worlds. We decompose a cross-encoder into an efficient bi-encoder integrated with Pooling by Multihead Attention and an Interaction Emulation Module, achieving nuanced understanding and pre-computability. Knowledge from the LLM is distilled into this model using contrastive, rank, and feature imitation techniques. Our experiments show that D2LLM surpasses five leading baselines in terms of all metrics across three tasks, particularly improving NLI task performance by at least 6.45%. The source code is available at https://github.com/codefuse-ai/D2LLM.
Paper Structure (36 sections, 11 equations, 4 figures, 5 tables)

This paper contains 36 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Architecture of D2LLM: The teacher model is decomposed into three segments corresponding to the input of the query, passage, and prompt, represented in light red, light blue, and light purple. Its output, represented by a dark purple square, is the classification token embedding, which, after a linear layer, yields logits. The student model maintains the query and passage components (and ) from the teacher but adds the PMA and IEM to capture the interplay between the query and passage, as well as their combined interaction with the prompt. It also outputs classification token embeddings respectively for symmetric and asymmetric search and then derives logits via a linear layer.
  • Figure 2: Runtime Analysis.
  • Figure 3: Effect of the hyperparameter $\alpha$, $\beta$ and $\gamma$ on OCNLI and CMNLI.
  • Figure 4: Effect of the hyperparameter $\alpha$, $\beta$ and $\gamma$ on T2Retrieval.