Table of Contents
Fetching ...

Robust Training for Conversational Question Answering Models with Reinforced Reformulation Generation

Magdalena Kaiser, Rishiraj Saha Roy, Gerhard Weikum

TL;DR

The paper tackles robustness in conversational QA over knowledge graphs by addressing surface-form brittleness when trained solely on gold QA pairs. It introduces REIGN, a framework combining a Reformulation Category Selector (RCS) trained with Deep Q-Networks and a Reformulation Generator (RG) fine-tuned with distantly supervised data to produce intent-preserving question reformulations, which augment QA training. A ConvQA model is then trained on both original and reformulated pairs, with GPT-generated reformulations used to stress-test robustness; experiments show consistent improvements over gold-only baselines across two ConvQA benchmarks and GPT-augmented test sets, highlighting the value of model-aware data augmentation. The work also provides extensive ablations, demonstrates zero-shot transfer to a second benchmark, and releases the reformulation resources, enabling broader adoption of robust ConvQA training.

Abstract

Models for conversational question answering (ConvQA) over knowledge graphs (KGs) are usually trained and tested on benchmarks of gold QA pairs. This implies that training is limited to surface forms seen in the respective datasets, and evaluation is on a small set of held-out questions. Through our proposed framework REIGN, we take several steps to remedy this restricted learning setup. First, we systematically generate reformulations of training questions to increase robustness of models to surface form variations. This is a particularly challenging problem, given the incomplete nature of such questions. Second, we guide ConvQA models towards higher performance by feeding it only those reformulations that help improve their answering quality, using deep reinforcement learning. Third, we demonstrate the viability of training major model components on one benchmark and applying them zero-shot to another. Finally, for a rigorous evaluation of robustness for trained models, we use and release large numbers of diverse reformulations generated by prompting GPT for benchmark test sets (resulting in 20x increase in sizes). Our findings show that ConvQA models with robust training via reformulations, significantly outperform those with standard training from gold QA pairs only.

Robust Training for Conversational Question Answering Models with Reinforced Reformulation Generation

TL;DR

The paper tackles robustness in conversational QA over knowledge graphs by addressing surface-form brittleness when trained solely on gold QA pairs. It introduces REIGN, a framework combining a Reformulation Category Selector (RCS) trained with Deep Q-Networks and a Reformulation Generator (RG) fine-tuned with distantly supervised data to produce intent-preserving question reformulations, which augment QA training. A ConvQA model is then trained on both original and reformulated pairs, with GPT-generated reformulations used to stress-test robustness; experiments show consistent improvements over gold-only baselines across two ConvQA benchmarks and GPT-augmented test sets, highlighting the value of model-aware data augmentation. The work also provides extensive ablations, demonstrates zero-shot transfer to a second benchmark, and releases the reformulation resources, enabling broader adoption of robust ConvQA training.

Abstract

Models for conversational question answering (ConvQA) over knowledge graphs (KGs) are usually trained and tested on benchmarks of gold QA pairs. This implies that training is limited to surface forms seen in the respective datasets, and evaluation is on a small set of held-out questions. Through our proposed framework REIGN, we take several steps to remedy this restricted learning setup. First, we systematically generate reformulations of training questions to increase robustness of models to surface form variations. This is a particularly challenging problem, given the incomplete nature of such questions. Second, we guide ConvQA models towards higher performance by feeding it only those reformulations that help improve their answering quality, using deep reinforcement learning. Third, we demonstrate the viability of training major model components on one benchmark and applying them zero-shot to another. Finally, for a rigorous evaluation of robustness for trained models, we use and release large numbers of diverse reformulations generated by prompting GPT for benchmark test sets (resulting in 20x increase in sizes). Our findings show that ConvQA models with robust training via reformulations, significantly outperform those with standard training from gold QA pairs only.
Paper Structure (19 sections, 5 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 5 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Performance-guided reformulation generation in Reign, illustrated through our running example conversation.
  • Figure 2: Workflow of Reign: RCS is trained by reinforcement learning, and RG by supervised learning.
  • Figure 3: Taxonomy of reformulation categories. Legend: part = question-part; INS = Insert, DEL = Delete, SUBS = Substitute; ent = entity mention, rel = relation, ent-type = entity type mention, ans-type = answer type mention; w/ = with.
  • Figure 4: Common category predictions by the RCS DQN.