Table of Contents
Fetching ...

Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction

Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz

TL;DR

This work designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments.

Abstract

Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.

Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction

TL;DR

This work designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments.

Abstract

Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.
Paper Structure (51 sections, 1 equation, 3 figures, 12 tables, 1 algorithm)

This paper contains 51 sections, 1 equation, 3 figures, 12 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the proposed construction pipeline for natural questions. The figure shows the processing of a single example. Rounded rectangles represent acquired data, with blue text indicating a hyperlink to another Wikipedia article. Arrow descriptions indicate automated procedures. The symbol of people denotes a step involving human verification depicted in Section \ref{['sec:pipeline-human-ver']}: Human Verification and in Figure \ref{['sec:pipeline-human-ver']}. The example data is in English for non-Polish readers, but the pipeline was originally executed on Polish data for PUGG creation.
  • Figure 2: The human verification procedure for all acquired candidates.
  • Figure 3: Overview of the proposed construction pipeline for template-based questions. The figure shows the processing of a single example. The symbol of people denotes a step involving human verification to ensure all questions are meaningful. The example data is in English for non-Polish readers, but the pipeline was originally executed on Polish data for PUGG creation.