Table of Contents
Fetching ...

Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts

Zhuo Chen, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Kewei Tu

TL;DR

This work tackles open-domain QA under long-context constraints by introducing a lightweight encoder that vectorizes additional retrieved contexts and interacts with a large language model via cross-attention. The approach extends the effective context length from baseline 2k tokens to up to 5k–10k tokens in dense form while keeping compute near the baseline. Empirical results across held-in, held-out, and ICL settings show consistent improvements over a strong 2k-baseline, with the frozen-encoder strategy offering the most stable gains. The method provides a simple, general pathway to leverage longer contexts in RAG-based ODQA without requiring large increases in computational resources, though it remains to be tested on larger LMs and in broader ICL scenarios.

Abstract

In the era of large language models, applying techniques such as Retrieval Augmented Generation can better address Open-Domain Question-Answering problems. Due to constraints including model sizes and computing resources, the length of context is often limited, and it becomes challenging to empower the model to cover overlong contexts while answering questions from open domains. This paper proposes a general and convenient method to covering longer contexts in Open-Domain Question-Answering tasks. It leverages a small encoder language model that effectively encodes contexts, and the encoding applies cross-attention with origin inputs. With our method, the origin language models can cover several times longer contexts while keeping the computing requirements close to the baseline. Our experiments demonstrate that after fine-tuning, there is improved performance across two held-in datasets, four held-out datasets, and also in two In Context Learning settings.

Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts

TL;DR

This work tackles open-domain QA under long-context constraints by introducing a lightweight encoder that vectorizes additional retrieved contexts and interacts with a large language model via cross-attention. The approach extends the effective context length from baseline 2k tokens to up to 5k–10k tokens in dense form while keeping compute near the baseline. Empirical results across held-in, held-out, and ICL settings show consistent improvements over a strong 2k-baseline, with the frozen-encoder strategy offering the most stable gains. The method provides a simple, general pathway to leverage longer contexts in RAG-based ODQA without requiring large increases in computational resources, though it remains to be tested on larger LMs and in broader ICL scenarios.

Abstract

In the era of large language models, applying techniques such as Retrieval Augmented Generation can better address Open-Domain Question-Answering problems. Due to constraints including model sizes and computing resources, the length of context is often limited, and it becomes challenging to empower the model to cover overlong contexts while answering questions from open domains. This paper proposes a general and convenient method to covering longer contexts in Open-Domain Question-Answering tasks. It leverages a small encoder language model that effectively encodes contexts, and the encoding applies cross-attention with origin inputs. With our method, the origin language models can cover several times longer contexts while keeping the computing requirements close to the baseline. Our experiments demonstrate that after fine-tuning, there is improved performance across two held-in datasets, four held-out datasets, and also in two In Context Learning settings.
Paper Structure (26 sections, 13 equations, 3 figures, 6 tables)

This paper contains 26 sections, 13 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: A comparison of our method (lower) and retrieval augmented ODQA without vectorization (upper). In the upper part, limited retrieved contexts are processed by the task model to finish the task. The lower part illustrates our method in which an encoder is incorporated to encode overlong retrieved contexts.
  • Figure 2: Speed illustration. Run time is measured on a single A100 GPU and the batch size is set to 1 for all curves. "2k" on the horizontal axis represents the baseline model's run time to train or infer on data of length 2k. "5k" and "10k" correspond to two variants of our method that can cover at most 5k and 10k tokens when training and inferring. Training time measures the average over five consecutive training steps. Inference time measures the average over five consecutive generation steps. Specifically, we measure the execution duration of functions Trainer.training_step and model.generate based on https://www.huggingface.co/.
  • Figure 3: Method illustration of model architecture (purple blocks) and data flows (along black/purple arrows). The purple dashed arrows mean that the output of MLP module will be the "query" to the next layer of Cross-attn module. $\times N$ means that the modules with dotted backgrounds are repeated with multiple layers in the task model.