Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources
Fomubad Borista Fondi, Azanzi Jiomekong Fidel, Gaoussou Camara
TL;DR
This work tackles question answering over heterogeneous scholarly data sources (DBLP, SemOpenAlex, and Wikipedia-based texts) within the ISWC 2024 QALD challenge. It proposes a hybrid pipeline that starts with SPARQL-based data retrieval, applies a divide-and-conquer strategy to manage diverse question types, and leverages a pre-trained extractive QA model (BERT-base-cased-SQuAD2) to generate and refine answers. Evaluation on the provided training/test sets via Codalab demonstrates that integrating SPARQL results with LLM-based predictions improves accuracy, particularly for complex, context-dependent author-related queries. The approach advances robust QA over large, interlinked scholarly knowledge graphs and text sources, with practical implications for scalable knowledge discovery and retrieval.
Abstract
The Scholarly Hybrid Question Answering over Linked Data (QALD) Challenge at the International Semantic Web Conference (ISWC) 2024 focuses on Question Answering (QA) over diverse scholarly sources: DBLP, SemOpenAlex, and Wikipedia-based texts. This paper describes a methodology that combines SPARQL queries, divide and conquer algorithms, and a pre-trained extractive question answering model. It starts with SPARQL queries to gather data, then applies divide and conquer to manage various question types and sources, and uses the model to handle personal author questions. The approach, evaluated with Exact Match and F-score metrics, shows promise for improving QA accuracy and efficiency in scholarly contexts.
