Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking

Yifan Ji; Zhipeng Xu; Zhenghao Liu; Yukun Yan; Shi Yu; Yishan Li; Zhiyuan Liu; Yu Gu; Ge Yu; Maosong Sun

Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking

Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, Maosong Sun

TL;DR

Debater tackles the limitation of single-embedding document representations in dense retrieval by introducing deliberate step-by-step thinking. It combines Chain-of-Deliberation (CoD) and Self Distillation (SD) to generate and fuse multiple intermediate document embeddings, enabling richer, multi-view representations learned through contrastive training. On BEIR, Debater delivers significant gains over strong baselines and enables smaller LLMs to approach or match larger models, demonstrating robustness across model scales. The approach is evaluated with comprehensive ablations and case studies, and code and data are publicly available for replication.

Abstract

Recent dense retrievers increasingly leverage the robust text understanding capabilities of Large Language Models (LLMs), encoding queries and documents into a shared embedding space for effective retrieval. However, most existing methods represent each document with a single embedding, which is less effective at capturing its multifaceted semantics and thereby limits matching accuracy. In this paper, we propose Deliberate Thinking based Dense Retriever (Debater), a novel approach that enhances document representations by incorporating a step-by-step thinking process. Debater introduces a Chain-of-Deliberation mechanism, which iteratively refines document embeddings through a continuous chain-of-thought. To integrate information from various thinking steps, Debater further employs a Self Distillation mechanism that identifies and fuses the most informative steps into a unified embedding. Experimental results show that Debater significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes and datasets are available at https://github.com/OpenBMB/DEBATER.

Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking

TL;DR

Abstract

Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)