Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset
Tilahun Abedissa Taffa, Debayan Banerjee, Yaregal Assabie, Ricardo Usbeck
TL;DR
The paper tackles the challenge of answering scholarly questions that require heterogeneous data by introducing Hybrid-SQuAD, a large-scale dataset that combines KG facts from DBLP and SemOpenAlex with Wikipedia text. It presents a RAG-based baseline that effectively fuses evidence from multiple sources to achieve high exact-match and F-score on the Hybrid-SQuAD test set. Key contributions include a data collection and linking pipeline across three sources, a QA generation framework, and an analysis of evidence traversal patterns across KG and text. The results demonstrate the value of retrieval-augmented approaches for cross-source scholarly QA and provide a public benchmark to stimulate further research in this area.
Abstract
Existing Scholarly Question Answering (QA) methods typically target homogeneous data sources, relying solely on either text or Knowledge Graphs (KGs). However, scholarly information often spans heterogeneous sources, necessitating the development of QA systems that integrate information from multiple heterogeneous data sources. To address this challenge, we introduce Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale QA dataset designed to facilitate answering questions incorporating both text and KG facts. The dataset consists of 10.5K question-answer pairs generated by a large language model, leveraging the KGs DBLP and SemOpenAlex alongside corresponding text from Wikipedia. In addition, we propose a RAG-based baseline hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD test set.
