Table of Contents
Fetching ...

Alexpaca: Learning Factual Clarification Question Generation Without Examples

Matthew Toles, Yukun Huang, Zhou Yu, Luis Gravano

TL;DR

Alexpaca tackles the problem of generating factual clarifying questions when initial context is incomplete by introducing HotpotQA-FLM, a self-supervised benchmark for pragmatic ACQ in a multi-hop QA setting. The method trains a clarifying-question model via rejection sampling on its own interaction with an answering agent, enabling fine-tuning without manual annotations. Results show humans outperform zero-shot models, while Alexpaca achieves a substantial improvement over a baseline Llama 3 8B Instruct, reaching about 28% relative recovery in information gain and surpassing GPT-3.5 Turbo in all reported metrics. The work demonstrates a scalable, private, and cost-effective approach to factual ACQ with potential benefits for high-stakes AI assistants and safety-critical applications.

Abstract

Real-life tasks such as giving legal or technical advice often lack complete context at the outset and can have disparate answers depending thereon. The ability to derive missing factual information by asking clarifying questions (ACQ) is an important element of real-life collaboration on such reasoning tasks. Existing factual clarification question challenges evaluate generations based on word overlap or human evaluations. Recent work explores generating a response to the clarifying question then evaluating its utility directly. So far, these tasks are limited to disambiguating the user's intent rather than concrete facts about the situation. The factual domain presents unique challenges since responses to clarification questions must be factually true for accurate evaluation. To enable evaluation of factual domain clarification question generation, We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. The task, HotpotQA-FLM, can be evaluated automatically, making it convenient for benchmarking language models. We observe that humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics. Finally, we find by fine-tuning Llama 3 8B Instruct on its own generations, filtered via rejection sampling, we can improve information recovery by 27.6 percent.

Alexpaca: Learning Factual Clarification Question Generation Without Examples

TL;DR

Alexpaca tackles the problem of generating factual clarifying questions when initial context is incomplete by introducing HotpotQA-FLM, a self-supervised benchmark for pragmatic ACQ in a multi-hop QA setting. The method trains a clarifying-question model via rejection sampling on its own interaction with an answering agent, enabling fine-tuning without manual annotations. Results show humans outperform zero-shot models, while Alexpaca achieves a substantial improvement over a baseline Llama 3 8B Instruct, reaching about 28% relative recovery in information gain and surpassing GPT-3.5 Turbo in all reported metrics. The work demonstrates a scalable, private, and cost-effective approach to factual ACQ with potential benefits for high-stakes AI assistants and safety-critical applications.

Abstract

Real-life tasks such as giving legal or technical advice often lack complete context at the outset and can have disparate answers depending thereon. The ability to derive missing factual information by asking clarifying questions (ACQ) is an important element of real-life collaboration on such reasoning tasks. Existing factual clarification question challenges evaluate generations based on word overlap or human evaluations. Recent work explores generating a response to the clarifying question then evaluating its utility directly. So far, these tasks are limited to disambiguating the user's intent rather than concrete facts about the situation. The factual domain presents unique challenges since responses to clarification questions must be factually true for accurate evaluation. To enable evaluation of factual domain clarification question generation, We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. The task, HotpotQA-FLM, can be evaluated automatically, making it convenient for benchmarking language models. We observe that humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics. Finally, we find by fine-tuning Llama 3 8B Instruct on its own generations, filtered via rejection sampling, we can improve information recovery by 27.6 percent.
Paper Structure (29 sections, 2 equations, 6 figures, 3 tables)

This paper contains 29 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the HotpotQA-FLM task, which simulates the need to formulate a question. Conventionally, the downstream model performs the downstream task directly (*B). However, in in HotpotQA-FLM (*B), critical information is missing ①. To acquire that information, the ACQ model ② first uses the context to generate a clarification question. The question is presented to the contextually knowledgeable answering agent ③, which generates a response. The response is sent as additional context to the downstream model ④. For strong ACQ models, we expect the downstream model to achieve better performance on context + answering agent response than on context alone.
  • Figure 2: An example containing a downstream task t, supporting facts $f^{sup}_{1,...,n}$, and distractor facts $f^{dis}_{1...n}$. (Additional facts not shown.) We create an incomplete example $x^i$ by masking one supporting fact, $f^*$, chosen at random, from the facts in the complete example $x^c$. Prompted with $x^i$, the ACQ model poses a question to the answering agent which returns one answering agent response $f_r$ from the supporting or distractor facts. We then append $x^r=x^i+f_r$, which we expect to improve downstream model performance $D(\cdot)$
  • Figure 3: F1 and exact match recovery for PACQ models and human annotators. Results shown for the Full validation set ($n=7404$) and the test set ($n=400$), which contains human-generated ACQ questions. Alexpaca-1r indicates single round rejection sampling.
  • Figure 4: Proportion of questions (Q) answered with a masked fact (MS) vs. distractor (D) by answering agent (left section). Proportion of answers given resulting in positive, zero, or negative difference in downstream model performance (right section).
  • Figure 5: Supporting, answered, and masked F1 as a function of downstream model architecture.
  • ...and 1 more figures