Table of Contents
Fetching ...

Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset

Tilahun Abedissa Taffa, Debayan Banerjee, Yaregal Assabie, Ricardo Usbeck

TL;DR

The paper tackles the challenge of answering scholarly questions that require heterogeneous data by introducing Hybrid-SQuAD, a large-scale dataset that combines KG facts from DBLP and SemOpenAlex with Wikipedia text. It presents a RAG-based baseline that effectively fuses evidence from multiple sources to achieve high exact-match and F-score on the Hybrid-SQuAD test set. Key contributions include a data collection and linking pipeline across three sources, a QA generation framework, and an analysis of evidence traversal patterns across KG and text. The results demonstrate the value of retrieval-augmented approaches for cross-source scholarly QA and provide a public benchmark to stimulate further research in this area.

Abstract

Existing Scholarly Question Answering (QA) methods typically target homogeneous data sources, relying solely on either text or Knowledge Graphs (KGs). However, scholarly information often spans heterogeneous sources, necessitating the development of QA systems that integrate information from multiple heterogeneous data sources. To address this challenge, we introduce Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale QA dataset designed to facilitate answering questions incorporating both text and KG facts. The dataset consists of 10.5K question-answer pairs generated by a large language model, leveraging the KGs DBLP and SemOpenAlex alongside corresponding text from Wikipedia. In addition, we propose a RAG-based baseline hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD test set.

Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset

TL;DR

The paper tackles the challenge of answering scholarly questions that require heterogeneous data by introducing Hybrid-SQuAD, a large-scale dataset that combines KG facts from DBLP and SemOpenAlex with Wikipedia text. It presents a RAG-based baseline that effectively fuses evidence from multiple sources to achieve high exact-match and F-score on the Hybrid-SQuAD test set. Key contributions include a data collection and linking pipeline across three sources, a QA generation framework, and an analysis of evidence traversal patterns across KG and text. The results demonstrate the value of retrieval-augmented approaches for cross-source scholarly QA and provide a public benchmark to stimulate further research in this area.

Abstract

Existing Scholarly Question Answering (QA) methods typically target homogeneous data sources, relying solely on either text or Knowledge Graphs (KGs). However, scholarly information often spans heterogeneous sources, necessitating the development of QA systems that integrate information from multiple heterogeneous data sources. To address this challenge, we introduce Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale QA dataset designed to facilitate answering questions incorporating both text and KG facts. The dataset consists of 10.5K question-answer pairs generated by a large language model, leveraging the KGs DBLP and SemOpenAlex alongside corresponding text from Wikipedia. In addition, we propose a RAG-based baseline hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD test set.

Paper Structure

This paper contains 20 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Sample scholarly information from heterogeneous data sources: DBLP, SemOpenAlex, and Wikipedia, along with a question and answer pair in Hybrid-SQuAD.
  • Figure 2: Baseline Model
  • Figure 3: First three words distributions in questions.