Table of Contents
Fetching ...

Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking

Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, Maosong Sun

TL;DR

Debater tackles the limitation of single-embedding document representations in dense retrieval by introducing deliberate step-by-step thinking. It combines Chain-of-Deliberation (CoD) and Self Distillation (SD) to generate and fuse multiple intermediate document embeddings, enabling richer, multi-view representations learned through contrastive training. On BEIR, Debater delivers significant gains over strong baselines and enables smaller LLMs to approach or match larger models, demonstrating robustness across model scales. The approach is evaluated with comprehensive ablations and case studies, and code and data are publicly available for replication.

Abstract

Recent dense retrievers increasingly leverage the robust text understanding capabilities of Large Language Models (LLMs), encoding queries and documents into a shared embedding space for effective retrieval. However, most existing methods represent each document with a single embedding, which is less effective at capturing its multifaceted semantics and thereby limits matching accuracy. In this paper, we propose Deliberate Thinking based Dense Retriever (Debater), a novel approach that enhances document representations by incorporating a step-by-step thinking process. Debater introduces a Chain-of-Deliberation mechanism, which iteratively refines document embeddings through a continuous chain-of-thought. To integrate information from various thinking steps, Debater further employs a Self Distillation mechanism that identifies and fuses the most informative steps into a unified embedding. Experimental results show that Debater significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes and datasets are available at https://github.com/OpenBMB/DEBATER.

Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking

TL;DR

Debater tackles the limitation of single-embedding document representations in dense retrieval by introducing deliberate step-by-step thinking. It combines Chain-of-Deliberation (CoD) and Self Distillation (SD) to generate and fuse multiple intermediate document embeddings, enabling richer, multi-view representations learned through contrastive training. On BEIR, Debater delivers significant gains over strong baselines and enables smaller LLMs to approach or match larger models, demonstrating robustness across model scales. The approach is evaluated with comprehensive ablations and case studies, and code and data are publicly available for replication.

Abstract

Recent dense retrievers increasingly leverage the robust text understanding capabilities of Large Language Models (LLMs), encoding queries and documents into a shared embedding space for effective retrieval. However, most existing methods represent each document with a single embedding, which is less effective at capturing its multifaceted semantics and thereby limits matching accuracy. In this paper, we propose Deliberate Thinking based Dense Retriever (Debater), a novel approach that enhances document representations by incorporating a step-by-step thinking process. Debater introduces a Chain-of-Deliberation mechanism, which iteratively refines document embeddings through a continuous chain-of-thought. To integrate information from various thinking steps, Debater further employs a Self Distillation mechanism that identifies and fuses the most informative steps into a unified embedding. Experimental results show that Debater significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes and datasets are available at https://github.com/OpenBMB/DEBATER.

Paper Structure

This paper contains 15 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The Illustration of Our Deliberate Thinking based DenseRetriever (Debater). Debater leverages the reasoning capability of LLM to conduct fine-grained document representations for retrieval.
  • Figure 2: The Architecture of Deliberate Thinking based DenseRetriever (Debater).
  • Figure 3: Retrieval Performance of Debater with Different Thinking Depths. We set the length of the CoD to train different Debater models and evaluate them on different subsets of BEIR.
  • Figure 4: Performance of Debater at Different Thinking Steps. We collect all documents from each thinking step to demonstrate the retrieval performance of Debater across different stages of reasoning.
  • Figure 5: Similarity Relationship Between Adjacent Position Embeddings. Darker blue indicates a higher similarity score.
  • ...and 1 more figures