Table of Contents
Fetching ...

ExpertGenQA: Open-ended QA generation in Specialized Domains

Haz Sameen Shahgir, Chansong Lim, Jia Chen, Evangelos E. Papalexakis, Yue Dong

TL;DR

ExpertGenQA introduces a domain-focused QA generation protocol that blends few-shot learning with dual style-topic categorization to produce diverse, topic-complete QA pairs grounded in expert-written exemplars. Evaluated on FRA regulatory documents, the approach doubles generation efficiency and attains $94.4\%$ topic coverage, while producing questions that better match expert cognitive requirements as per Bloom's taxonomy. When used to train a retrieval model, ExpertGenQA data improve top-1 accuracy by $13.02\%$ over strong baselines, demonstrating practical gains for technical information retrieval. The study also reveals biases in current LLM-based judges and reward models toward superficial writing, underscoring the need for retrieval-grounded evaluation in domain-specific QA tasks. Overall, ExpertGenQA offers a scalable path to high-quality, domain-relevant QA generation with measurable downstream benefits, while acknowledging domain specificity and computation costs as directions for future work.

Abstract

Generating high-quality question-answer pairs for specialized technical domains remains challenging, with existing approaches facing a tradeoff between leveraging expert examples and achieving topical diversity. We present ExpertGenQA, a protocol that combines few-shot learning with structured topic and style categorization to generate comprehensive domain-specific QA pairs. Using U.S. Federal Railroad Administration documents as a test bed, we demonstrate that ExpertGenQA achieves twice the efficiency of baseline few-shot approaches while maintaining $94.4\%$ topic coverage. Through systematic evaluation, we show that current LLM-based judges and reward models exhibit strong bias toward superficial writing styles rather than content quality. Our analysis using Bloom's Taxonomy reveals that ExpertGenQA better preserves the cognitive complexity distribution of expert-written questions compared to template-based approaches. When used to train retrieval models, our generated queries improve top-1 accuracy by $13.02\%$ over baseline performance, demonstrating their effectiveness for downstream applications in technical domains.

ExpertGenQA: Open-ended QA generation in Specialized Domains

TL;DR

ExpertGenQA introduces a domain-focused QA generation protocol that blends few-shot learning with dual style-topic categorization to produce diverse, topic-complete QA pairs grounded in expert-written exemplars. Evaluated on FRA regulatory documents, the approach doubles generation efficiency and attains topic coverage, while producing questions that better match expert cognitive requirements as per Bloom's taxonomy. When used to train a retrieval model, ExpertGenQA data improve top-1 accuracy by over strong baselines, demonstrating practical gains for technical information retrieval. The study also reveals biases in current LLM-based judges and reward models toward superficial writing, underscoring the need for retrieval-grounded evaluation in domain-specific QA tasks. Overall, ExpertGenQA offers a scalable path to high-quality, domain-relevant QA generation with measurable downstream benefits, while acknowledging domain specificity and computation costs as directions for future work.

Abstract

Generating high-quality question-answer pairs for specialized technical domains remains challenging, with existing approaches facing a tradeoff between leveraging expert examples and achieving topical diversity. We present ExpertGenQA, a protocol that combines few-shot learning with structured topic and style categorization to generate comprehensive domain-specific QA pairs. Using U.S. Federal Railroad Administration documents as a test bed, we demonstrate that ExpertGenQA achieves twice the efficiency of baseline few-shot approaches while maintaining topic coverage. Through systematic evaluation, we show that current LLM-based judges and reward models exhibit strong bias toward superficial writing styles rather than content quality. Our analysis using Bloom's Taxonomy reveals that ExpertGenQA better preserves the cognitive complexity distribution of expert-written questions compared to template-based approaches. When used to train retrieval models, our generated queries improve top-1 accuracy by over baseline performance, demonstrating their effectiveness for downstream applications in technical domains.

Paper Structure

This paper contains 39 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the ExpertGenQA pipeline (left) and proposed evaluation metrics (right). Green checkmarks ( ) indicate interpretable metrics that correlate with improved retrieval accuracy, our primary evaluation metric. The red cross ( ) indicates our finding that both Reward Models and LLM-as-Judge show bias toward superfluous writing style and lack correlation with retrieval accuracy.
  • Figure 2: Comparison of efficiency across question-generation pipelines over the different number of few-shot examples. We define efficiency as the fraction of unique generations over the total sampled generations.
  • Figure 3: Box plot of reward assigned to questions by Llama-3.1-Nemotron-70B-Instruct Reward Model. Notably, merely rephrasing synthetic questions to sound human-like drastically increases the assigned reward score although the semantic content hasn't changed.
  • Figure 4: Distribution of cognitive complexity levels in human-written and synthetic instructions according to Bloom's Revised Taxonomy. MDCure shows higher concentration in lower cognitive levels.
  • Figure 5: Box plot of scores assigned by GPT4o-as-Judge using the MDCure prompt Liu2024MDCureAS. GPT4o-as-Judge assigned similar scores for all generation methods and hence does not correlate with the clear differences in downstream task improvements shown in Table \ref{['tab:retreival']}.
  • ...and 1 more figures