Table of Contents
Fetching ...

Conformer-Kernel with Query Term Independence for Document Retrieval

Bhaskar Mitra, Sebastian Hofstatter, Hamed Zamani, Nick Craswell

TL;DR

This work extends Transformer-Kernel to full document retrieval by introducing a memory-efficient Conformer layer, a Query Term Independence (QTI) encoding approach, and an explicit lexical term-matching component. The Conformer reduces self-attention memory from quadratic to linear in sequence length, enabling ranking over long documents, while QTI enables efficient offline precomputation and retrieval. An explicit term-matching branch (Duet-style) complements latent interactions to reduce false positives. Preliminary results on MS MARCO-based full retrieval show competitive gains over traditional baselines and highlight memory efficiency, with BERT-based methods still leading in some settings and future pretraining and sampling strategies identified as key directions.

Abstract

The Transformer-Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark---and can be considered to be an efficient (but slightly less effective) alternative to BERT-based ranking models. In this work, we extend the TK architecture to the full retrieval setting by incorporating the query term independence assumption. Furthermore, to reduce the memory complexity of the Transformer layers with respect to the input sequence length, we propose a new Conformer layer. We show that the Conformer's GPU memory requirement scales linearly with input sequence length, making it a more viable option when ranking long documents. Finally, we demonstrate that incorporating explicit term matching signal into the model can be particularly useful in the full retrieval setting. We present preliminary results from our work in this paper.

Conformer-Kernel with Query Term Independence for Document Retrieval

TL;DR

This work extends Transformer-Kernel to full document retrieval by introducing a memory-efficient Conformer layer, a Query Term Independence (QTI) encoding approach, and an explicit lexical term-matching component. The Conformer reduces self-attention memory from quadratic to linear in sequence length, enabling ranking over long documents, while QTI enables efficient offline precomputation and retrieval. An explicit term-matching branch (Duet-style) complements latent interactions to reduce false positives. Preliminary results on MS MARCO-based full retrieval show competitive gains over traditional baselines and highlight memory efficiency, with BERT-based methods still leading in some settings and future pretraining and sampling strategies identified as key directions.

Abstract

The Transformer-Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark---and can be considered to be an efficient (but slightly less effective) alternative to BERT-based ranking models. In this work, we extend the TK architecture to the full retrieval setting by incorporating the query term independence assumption. Furthermore, to reduce the memory complexity of the Transformer layers with respect to the input sequence length, we propose a new Conformer layer. We show that the Conformer's GPU memory requirement scales linearly with input sequence length, making it a more viable option when ranking long documents. Finally, we demonstrate that incorporating explicit term matching signal into the model can be particularly useful in the full retrieval setting. We present preliminary results from our work in this paper.

Paper Structure

This paper contains 13 sections, 9 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: A comparison of the TK and the proposed CK-with-QTI architectures. In addition to replacing the Transformer layers with Conformers, the latter also simplifies the query encoding to non-contextualized term embedding lookup and incorporates a windowed Kernel-Pooling based aggregation that is employed independently per query term.
  • Figure 2: Comparison of peak GPU Memory Usage in MB, across all four GPUs, when employing Transformers vs. Conformers in our proposed architecture. For the Transformer-based model, we only plot till sequence length of 512, because for longer sequences we run out of GPU memory when using Tesla P100s with 16 GB of memory.