Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources

Fomubad Borista Fondi; Azanzi Jiomekong Fidel; Gaoussou Camara

Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources

Fomubad Borista Fondi, Azanzi Jiomekong Fidel, Gaoussou Camara

TL;DR

This work tackles question answering over heterogeneous scholarly data sources (DBLP, SemOpenAlex, and Wikipedia-based texts) within the ISWC 2024 QALD challenge. It proposes a hybrid pipeline that starts with SPARQL-based data retrieval, applies a divide-and-conquer strategy to manage diverse question types, and leverages a pre-trained extractive QA model (BERT-base-cased-SQuAD2) to generate and refine answers. Evaluation on the provided training/test sets via Codalab demonstrates that integrating SPARQL results with LLM-based predictions improves accuracy, particularly for complex, context-dependent author-related queries. The approach advances robust QA over large, interlinked scholarly knowledge graphs and text sources, with practical implications for scalable knowledge discovery and retrieval.

Abstract

The Scholarly Hybrid Question Answering over Linked Data (QALD) Challenge at the International Semantic Web Conference (ISWC) 2024 focuses on Question Answering (QA) over diverse scholarly sources: DBLP, SemOpenAlex, and Wikipedia-based texts. This paper describes a methodology that combines SPARQL queries, divide and conquer algorithms, and a pre-trained extractive question answering model. It starts with SPARQL queries to gather data, then applies divide and conquer to manage various question types and sources, and uses the model to handle personal author questions. The approach, evaluated with Exact Match and F-score metrics, shows promise for improving QA accuracy and efficiency in scholarly contexts.

Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 8 figures)

This paper contains 12 sections, 1 equation, 8 figures.

Introduction
Methodology
Data Processing and Query Execution
Divide and Conquer Approach
Data Retrieval and Aggregation
Large Language Model-Based Predictions
How questions were approached
Evaluation and Finalization
Experimentation Environment
Results and Discussion
Conclusion
Online Resources

Figures (8)

Figure 1: Methodology pipeline for the Scholarly Hybrid QALD Challenge.
Figure 2: Data processing and cleaning.
Figure 3: Example of the "CitedBy count" set creation.
Figure 4: Sample of how questions were approached
Figure 5: SPARQL query to retrieve author name
...and 3 more figures

Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources

TL;DR

Abstract

Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources

Authors

TL;DR

Abstract

Table of Contents

Figures (8)