Table of Contents
Fetching ...

Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks

Sizhe Yuen, Ting Su, Ziyang Wang, Yali Du, Adam J. Sobey

TL;DR

The paper tackles the challenge of knowledge-intensive QA by proposing an automated pipeline that generates context-based QA pairs from external documents to fine-tune large language models. It introduces a two-stage framework with an Automated QA Generator and a Model Fine-Tuner, and leverages a self-improving cycle to reduce human labelling. Extensive evaluation on the TechQA dataset shows that training with original data under context yields the best results, while synthetic, context-free data can still improve performance in low-context settings. The work demonstrates that synthetic QA data can augment fine-tuning, enabling more scalable domain adaptation, though the strongest performance still relies on context-aware supervision.

Abstract

A question-answering (QA) system is to search suitable answers within a knowledge base. Current QA systems struggle with queries requiring complex reasoning or real-time knowledge integration. They are often supplemented with retrieval techniques on a data source such as Retrieval-Augmented Generation (RAG). However, RAG continues to face challenges in handling complex reasoning and logical connections between multiple sources of information. A novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks is presented through the automated generation of context-based QA pairs. This methodology leverages LLMs to create fine-tuning data, reducing reliance on human labelling and improving model comprehension and reasoning capabilities. The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and BERTScore. Comprehensive experiments demonstrate improvements in logical coherence and factual accuracy, with implications for developing adaptable Artificial Intelligence (AI) systems. Mistral-7b-v0.3 outperforms Llama-3-8b with BERT F1, BLEU, and ROUGE scores 0.858, 0.172, and 0.260 of for the LLM generated QA pairs compared to scores of 0.836, 0.083, and 0.139 for the human annotated QA pairs.

Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks

TL;DR

The paper tackles the challenge of knowledge-intensive QA by proposing an automated pipeline that generates context-based QA pairs from external documents to fine-tune large language models. It introduces a two-stage framework with an Automated QA Generator and a Model Fine-Tuner, and leverages a self-improving cycle to reduce human labelling. Extensive evaluation on the TechQA dataset shows that training with original data under context yields the best results, while synthetic, context-free data can still improve performance in low-context settings. The work demonstrates that synthetic QA data can augment fine-tuning, enabling more scalable domain adaptation, though the strongest performance still relies on context-aware supervision.

Abstract

A question-answering (QA) system is to search suitable answers within a knowledge base. Current QA systems struggle with queries requiring complex reasoning or real-time knowledge integration. They are often supplemented with retrieval techniques on a data source such as Retrieval-Augmented Generation (RAG). However, RAG continues to face challenges in handling complex reasoning and logical connections between multiple sources of information. A novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks is presented through the automated generation of context-based QA pairs. This methodology leverages LLMs to create fine-tuning data, reducing reliance on human labelling and improving model comprehension and reasoning capabilities. The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and BERTScore. Comprehensive experiments demonstrate improvements in logical coherence and factual accuracy, with implications for developing adaptable Artificial Intelligence (AI) systems. Mistral-7b-v0.3 outperforms Llama-3-8b with BERT F1, BLEU, and ROUGE scores 0.858, 0.172, and 0.260 of for the LLM generated QA pairs compared to scores of 0.836, 0.083, and 0.139 for the human annotated QA pairs.

Paper Structure

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Example generation procedure of QA pairs with the TechQA dataset.
  • Figure 2: F1 score when answering with the context document provided.
  • Figure 3: F1 score when answering with no context provided.