Table of Contents
Fetching ...

O1 Embedder: Let Retrievers Think Before Action

Ruiran Yan, Zheng Liu, Defu Lian

TL;DR

This work introduces O1 Embedder, a retrieval model that performs slow-thinking before retrieval by generating and leveraging long-form thoughts to produce thought-augmented embeddings. It combines exploration-refinement data synthesis with a multi-task training regime that uses behavior cloning for thought generation and contrastive learning for dense retrieval, all under a memory-efficient joint training framework. Across 12 datasets spanning in-domain and BEIR out-of-domain benchmarks, O1 Embedder consistently outperforms strong baselines, with notable gains in complex and multi-hop settings, and demonstrates robustness across backbone models and varying scales. The approach offers a practical path toward next-generation IR foundation models by integrating reasoning-like thinking with dense retrieval, enabling more accurate and generalizable information access.

Abstract

The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs' ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.

O1 Embedder: Let Retrievers Think Before Action

TL;DR

This work introduces O1 Embedder, a retrieval model that performs slow-thinking before retrieval by generating and leveraging long-form thoughts to produce thought-augmented embeddings. It combines exploration-refinement data synthesis with a multi-task training regime that uses behavior cloning for thought generation and contrastive learning for dense retrieval, all under a memory-efficient joint training framework. Across 12 datasets spanning in-domain and BEIR out-of-domain benchmarks, O1 Embedder consistently outperforms strong baselines, with notable gains in complex and multi-hop settings, and demonstrates robustness across backbone models and varying scales. The approach offers a practical path toward next-generation IR foundation models by integrating reasoning-like thinking with dense retrieval, enabling more accurate and generalizable information access.

Abstract

The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs' ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.

Paper Structure

This paper contains 32 sections, 12 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: O1 Embedder. First of all, the model generates the thoughts about the question (thinking). Next, the model produces the embedding for dense retrieval (retrieval).
  • Figure 2: The production of thought data. In the first step, the LLM is prompted to generate candidates thoughts about the input question based on the instruction and in-context examples. In the second step, the retrieval committee is employed to evaluate the candidates by making comparison with the ground-truth document, i.e. the retrieval target. Finally, the candidate thought receiving the maximum votes is selected and incorporated to the training data.
  • Figure 3: Training and Retrieval process of O1 Embedder. During the training process, O1 embedder minimizes two losses: the generation loss while decoding the thought, and the contrastive loss while discriminating the target document. During the retrieval process, multiple thoughts are generated for the query. The thoughts are used to produce thought-augmented queries, which are independently encoded by O1 Embedder and aggregated for retrieval.
  • Figure 4: Top 20 Attention score from <emb> token in the thought