Table of Contents
Fetching ...

How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation

Yoo Yeon Sung, Ishani Mondal, Jordan Boyd-Graber

TL;DR

The paper investigates how the rise of powerful large language models affects the process of dynamic adversarial question generation (DADC). It introduces an LLM-enabled writing interface with retrieval-model guidance to diagnose why questions stump or fail, and it formalizes a novel Item Response Theory–based metric to evaluate and incentivize high-quality, adversarial questions. Through empirical studies, it shows that LLMs can both hinder and help question authors: retrieval-evidence and trivia-norm constraints improve question quality, while some llm-driven tactics lead to vague or less effective queries; retrieval evidence, particularly from dense passage retrieval, enhances the ability to stump llms like chatgpt. The work advances quantitative evaluation of adversarial QA, offers a practical interface and dataset, and points to future directions for calibrating and evolving QA models in the presence of pervasive LLMs.

Abstract

Dynamic adversarial question generation, where humans write examples to stump a model, aims to create examples that are realistic and informative. However, the advent of large language models (LLMs) has been a double-edged sword for human authors: more people are interested in seeing and pushing the limits of these models, but because the models are so much stronger an opponent, they are harder to defeat. To understand how these models impact adversarial question writing process, we enrich the writing guidance with LLMs and retrieval models for the authors to reason why their questions are not adversarial. While authors could create interesting, challenging adversarial questions, they sometimes resort to tricks that result in poor questions that are ambiguous, subjective, or confusing not just to a computer but also to humans. To address these issues, we propose new metrics and incentives for eliciting good, challenging questions and present a new dataset of adversarially authored questions.

How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation

TL;DR

The paper investigates how the rise of powerful large language models affects the process of dynamic adversarial question generation (DADC). It introduces an LLM-enabled writing interface with retrieval-model guidance to diagnose why questions stump or fail, and it formalizes a novel Item Response Theory–based metric to evaluate and incentivize high-quality, adversarial questions. Through empirical studies, it shows that LLMs can both hinder and help question authors: retrieval-evidence and trivia-norm constraints improve question quality, while some llm-driven tactics lead to vague or less effective queries; retrieval evidence, particularly from dense passage retrieval, enhances the ability to stump llms like chatgpt. The work advances quantitative evaluation of adversarial QA, offers a practical interface and dataset, and points to future directions for calibrating and evolving QA models in the presence of pervasive LLMs.

Abstract

Dynamic adversarial question generation, where humans write examples to stump a model, aims to create examples that are realistic and informative. However, the advent of large language models (LLMs) has been a double-edged sword for human authors: more people are interested in seeing and pushing the limits of these models, but because the models are so much stronger an opponent, they are harder to defeat. To understand how these models impact adversarial question writing process, we enrich the writing guidance with LLMs and retrieval models for the authors to reason why their questions are not adversarial. While authors could create interesting, challenging adversarial questions, they sometimes resort to tricks that result in poor questions that are ambiguous, subjective, or confusing not just to a computer but also to humans. To address these issues, we propose new metrics and incentives for eliciting good, challenging questions and present a new dataset of adversarially authored questions.
Paper Structure (36 sections, 7 equations, 7 figures, 11 tables)

This paper contains 36 sections, 7 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Our IRT analysis exposes what makes for good and poor adversarial questions. The poor questions that had low discriminability but high difficulty (blue) lack specificity despite their adversarialness against chatgpt (e.g., There are many songs in 1987 containing a line that rhymes with "ketchup"). The questions that had both high discriminability and difficulty (green) met the criterion of being good, adversarial questions.
  • Figure 2: As the target answer to the question shoud be "Apple Inc", the interface is updated with answers from retrieval models with the most relevant sentence and from llms (e.g., Distilbert, T5). Also, the highlights are updated by the input perturbation technique. The diversity widget is updated with the country representation of the questions and suggested countries.
  • Figure 3: The number of questions that stump only machines (top left) was comparable with the number of questions that stump both humans and machines (top right).
  • Figure 4: The adversarial techniques Temporal Misalignment, Composing Seen Clues, Domain Expert Knowledge, and Novel Clues are used more frequently in questions with high discriminability.
  • Figure 5: The adversarial techniques Location Alignment, Multistep Reasoning, Domain Expert Knowledge, and Logic & Calculation are used less in questions with high discriminability.
  • ...and 2 more figures