Table of Contents
Fetching ...

Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference

William Thorne, Ambrose Robinson, Bohua Peng, Chenghua Lin, Diana Maynard

TL;DR

The paper tackles the need for challenging, domain-specific MRC datasets in the cultural heritage sector to evaluate RAG-based QA systems. It introduces a cost-efficient RLHF pipeline that uses synthetic preference data derived from multiple QA models on SQuAD to learn a reward model, which then guides PPO-based question generation with adapters. Key contributions include a complete methodology for increasing question difficulty, empirical evidence from both automated metrics and human evaluation, an in-depth error analysis, and an open-source codebase with adapters for reproducibility. The approach enables practitioners to generate harder, domain-relevant evaluation data without costly manual curation and can be adapted to other domains with similar constraints.

Abstract

As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it's equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method's effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.

Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference

TL;DR

The paper tackles the need for challenging, domain-specific MRC datasets in the cultural heritage sector to evaluate RAG-based QA systems. It introduces a cost-efficient RLHF pipeline that uses synthetic preference data derived from multiple QA models on SQuAD to learn a reward model, which then guides PPO-based question generation with adapters. Key contributions include a complete methodology for increasing question difficulty, empirical evidence from both automated metrics and human evaluation, an in-depth error analysis, and an open-source codebase with adapters for reproducibility. The approach enables practitioners to generate harder, domain-relevant evaluation data without costly manual curation and can be adapted to other domains with similar constraints.

Abstract

As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it's equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method's effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.

Paper Structure

This paper contains 21 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Example generated questions from supervised-fine-tuned question generation model and one fine-tuned with PPO from synthetic difficulty samples.
  • Figure 2: Depiction of our dataset generation pipeline. Question-Answering models are first used to create pairwise comparison data to train a reward model. An SFT model is trained on the train split of SQuAD and then fine-tuned using the reward model, producing the RL model. When generating question-answer pairs for the final dataset, generations are passed through the format critics to ensure data quality.
  • Figure 3: Example training sample from the reformatted SQuAD dataset for use in supervised fine-tuning.
  • Figure 4: Distribution of reference free metrics results for each model's generations based on our SQuAD test set.
  • Figure 5: Error distribution of questions for SFT, ZeroShot, and the two best performing PPO variants.
  • ...and 2 more figures