Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

Zhengbao Jiang; Luyu Gao; Jun Araki; Haibo Ding; Zhiruo Wang; Jamie Callan; Graham Neubig

Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

Zhengbao Jiang, Luyu Gao, Jun Araki, Haibo Ding, Zhiruo Wang, Jamie Callan, Graham Neubig

TL;DR

This work reframes open-domain QA as a single end-to-end Transformer problem by treating retrieval as attention within a unified model, eliminating the need for separately trained retrievers and readers. By using the first $B$ encoder layers as a bi-encoder and the remaining layers as a cross-encoder, the approach computes retrieval scores from token-level attention and refines them through cross-document distillation against decoder-to-encoder attention. The method achieves competitive retrieval and QA performance on Natural Questions and demonstrates strong zero-shot and domain-adaptation capabilities on BEIR, highlighting end-to-end learning as a practical path for knowledge-intensive tasks. The results suggest that end-to-end adaptation, cross-document adjustment, and attention-based retrieval can yield robust, adaptable systems without retrieval-specific warm-up or annotations. $

Abstract

Systems for knowledge-intensive tasks such as open-domain question answering (QA) usually consist of two stages: efficient retrieval of relevant documents from a large corpus and detailed reading of the selected documents to generate answers. Retrievers and readers are usually modeled separately, which necessitates a cumbersome implementation and is hard to train and adapt in an end-to-end fashion. In this paper, we revisit this design and eschew the separate architecture and training in favor of a single Transformer that performs Retrieval as Attention (ReAtt), and end-to-end training solely based on supervision from the end QA task. We demonstrate for the first time that a single model trained end-to-end can achieve both competitive retrieval and QA performance, matching or slightly outperforming state-of-the-art separately trained retrievers and readers. Moreover, end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings, making our model a simple and adaptable solution for knowledge-intensive tasks. Code and models are available at https://github.com/jzbjyb/ReAtt.

Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

TL;DR

encoder layers as a bi-encoder and the remaining layers as a cross-encoder, the approach computes retrieval scores from token-level attention and refines them through cross-document distillation against decoder-to-encoder attention. The method achieves competitive retrieval and QA performance on Natural Questions and demonstrates strong zero-shot and domain-adaptation capabilities on BEIR, highlighting end-to-end learning as a practical path for knowledge-intensive tasks. The results suggest that end-to-end adaptation, cross-document adjustment, and attention-based retrieval can yield robust, adaptable systems without retrieval-specific warm-up or annotations. $

Abstract

Paper Structure (37 sections, 7 equations, 2 figures, 8 tables)

This paper contains 37 sections, 7 equations, 2 figures, 8 tables.

Introduction
Retrieval as Attention (ReAtt)
Formal Definition
Leveraging Attention for Retrieval
Putting the Retriever into Transformers
From Token Attention to Document Relevance
End-to-end Retrieval with Attention
How Good is Attention As-is?
Learning Retrieval as Attention
Approximate Attention over the Corpus
Iterative Close Document Sub-sampling
In-batch Random Document Sub-sampling
Cross-document Adjustment with Decoder-to-Encoder Attention Distillation
Minimizing KL-divergence Between Retrieval and Target Attention
Zero Target Attention for Random Documents
...and 22 more sections

Figures (2)

Figure 1: Illustration of Retrieval as Attention (ReAtt) with the first $B$=2 encoder layers as bi-encoder (i.e., retriever) and the rest $L$-$B$=2 layers as cross-encoder. During training, the retrieval attention between a query $\bm{q}_1$ and documents $\bm{d}_{11,12,13}$ is adjusted by minimizing its discrepancy from the target attention. For simplicity, we use a single arrow to represent attention of a single head between multiple tokens.
Figure 2: Illustration of approximate attention over the corpus with $|\mathcal{Q}|$=4 queries in a batch and $K$=3 close documents per query. We use $\bm{q}_1$ as an example to illustrate the required computation, where close documents require both retrieval and target attention while random documents only require retrieval attention.

Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

TL;DR

Abstract

Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (2)