Generative Data Augmentation using LLMs improves Distributional Robustness in Question Answering
Arijit Ghosh Chowdhury, Aman Chadha
TL;DR
The paper addresses QA robustness under natural distribution shifts by proposing a data-centric augmentation pipeline that uses in-the-wild LLMs to generate contexts conditioned on SQuAD questions and to produce corresponding QA pairs. A RoBERTa-Base extractive QA model trained on real SQuAD data benefits from augmented data, with evaluations on natural distribution-shift benchmarks (NewWiki, NYT, Reddit, Amazon) showing improved robustness. Key findings show that mixing real and generated data yields the best balance between robustness and in-domain accuracy, and that generating both contexts and questions is crucial for generalization, whereas context-only or question-only generation has limited or negative effects. The work demonstrates a scalable, practical approach to enhancing domain generalization in QA and informs data augmentation strategies for robust NLP systems; future work includes broader QA-generation comparisons and scaling to larger models.
Abstract
Robustness in Natural Language Processing continues to be a pertinent issue, where state of the art models under-perform under naturally shifted distributions. In the context of Question Answering, work on domain adaptation methods continues to be a growing body of research. However, very little attention has been given to the notion of domain generalization under natural distribution shifts, where the target domain is unknown. With drastic improvements in the quality and access to generative models, we answer the question: How do generated datasets influence the performance of QA models under natural distribution shifts? We perform experiments on 4 different datasets under varying amounts of distribution shift, and analyze how "in-the-wild" generation can help achieve domain generalization. We take a two-step generation approach, generating both contexts and QA pairs to augment existing datasets. Through our experiments, we demonstrate how augmenting reading comprehension datasets with generated data leads to better robustness towards natural distribution shifts.
