Table of Contents
Fetching ...

Automatic Question-Answer Generation for Long-Tail Knowledge

Rohan Kumar, Youngmin Kim, Sunitha Ravi, Haitian Sun, Christos Faloutsos, Ruslan Salakhutdinov, Minji Yoon

TL;DR

An automatic approach to generate specialized QA datasets for tail entities and present the associated research challenges is proposed and extensive experiments are conducted by employing pretrained LLMs on newly generated long-tail QA datasets.

Abstract

Pretrained Large Language Models (LLMs) have gained significant attention for addressing open-domain Question Answering (QA). While they exhibit high accuracy in answering questions related to common knowledge, LLMs encounter difficulties in learning about uncommon long-tail knowledge (tail entities). Since manually constructing QA datasets demands substantial human resources, the types of existing QA datasets are limited, leaving us with a scarcity of datasets to study the performance of LLMs on tail entities. In this paper, we propose an automatic approach to generate specialized QA datasets for tail entities and present the associated research challenges. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets, comparing their performance with and without external resources including Wikipedia and Wikidata knowledge graphs.

Automatic Question-Answer Generation for Long-Tail Knowledge

TL;DR

An automatic approach to generate specialized QA datasets for tail entities and present the associated research challenges is proposed and extensive experiments are conducted by employing pretrained LLMs on newly generated long-tail QA datasets.

Abstract

Pretrained Large Language Models (LLMs) have gained significant attention for addressing open-domain Question Answering (QA). While they exhibit high accuracy in answering questions related to common knowledge, LLMs encounter difficulties in learning about uncommon long-tail knowledge (tail entities). Since manually constructing QA datasets demands substantial human resources, the types of existing QA datasets are limited, leaving us with a scarcity of datasets to study the performance of LLMs on tail entities. In this paper, we propose an automatic approach to generate specialized QA datasets for tail entities and present the associated research challenges. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets, comparing their performance with and without external resources including Wikipedia and Wikidata knowledge graphs.
Paper Structure (18 sections, 3 figures, 5 tables)

This paper contains 18 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the automatic QA data construction process for long-tail knowledge: We first sample tail entities that have low degrees and extract the connected triplets from Wikidata knowledge graph; Then we prompt GPT3 with the triplets to generate natural language questions.
  • Figure 2: Node degree distribution of all entities in Wikidata.
  • Figure 3: Density of properties per the number of possible s2 object entities before (Top) and after (Bottom) the difficulty controlling.