Table of Contents
Fetching ...

Context Filtering with Reward Modeling in Question Answering

Sangryul Kim, James Thorne

TL;DR

This work tackles the problem of irrelevant context degrading QA performance on long texts by introducing a reward-modeling–driven context-filtering approach. It leverages Direct Preference Optimization to train summarizers that selectively preserve information essential for answering questions, evaluated with a novel EM Per Token metric to quantify token efficiency. Across open-domain QA datasets, the method demonstrates substantial token-length reductions with only modest losses in EM/F1, achieving notable gains in efficiency, especially in low-resource or long-context settings. The findings suggest practical benefits for building token-efficient QA systems and point to future work integrating reward modeling with retrieval to further optimize information access and filtering across multiple contexts.

Abstract

Question Answering (QA) in NLP is the task of finding answers to a query within a relevant context retrieved by a retrieval system. Yet, the mix of relevant and irrelevant information in these contexts can hinder performance enhancements in QA tasks. To address this, we introduce a context filtering approach that removes non-essential details, summarizing crucial content through Reward Modeling. This method emphasizes keeping vital data while omitting the extraneous during summarization model training. We offer a framework for developing efficient QA models by discerning useful information from dataset pairs, bypassing the need for costly human evaluation. Furthermore, we show that our approach can significantly outperform the baseline, as evidenced by a 6.8-fold increase in the EM Per Token (EPT) metric, which we propose as a measure of token efficiency, indicating a notable token-efficiency boost for low-resource settings.

Context Filtering with Reward Modeling in Question Answering

TL;DR

This work tackles the problem of irrelevant context degrading QA performance on long texts by introducing a reward-modeling–driven context-filtering approach. It leverages Direct Preference Optimization to train summarizers that selectively preserve information essential for answering questions, evaluated with a novel EM Per Token metric to quantify token efficiency. Across open-domain QA datasets, the method demonstrates substantial token-length reductions with only modest losses in EM/F1, achieving notable gains in efficiency, especially in low-resource or long-context settings. The findings suggest practical benefits for building token-efficient QA systems and point to future work integrating reward modeling with retrieval to further optimize information access and filtering across multiple contexts.

Abstract

Question Answering (QA) in NLP is the task of finding answers to a query within a relevant context retrieved by a retrieval system. Yet, the mix of relevant and irrelevant information in these contexts can hinder performance enhancements in QA tasks. To address this, we introduce a context filtering approach that removes non-essential details, summarizing crucial content through Reward Modeling. This method emphasizes keeping vital data while omitting the extraneous during summarization model training. We offer a framework for developing efficient QA models by discerning useful information from dataset pairs, bypassing the need for costly human evaluation. Furthermore, we show that our approach can significantly outperform the baseline, as evidenced by a 6.8-fold increase in the EM Per Token (EPT) metric, which we propose as a measure of token efficiency, indicating a notable token-efficiency boost for low-resource settings.

Paper Structure

This paper contains 22 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: For an effective QA task, we conduct context filtering through the process of creating better summarization using a reward model. Simultaneously, we make it possible to discern which parts are helpful and which are filtered out by utilizing rewards extracted from the data.