Table of Contents
Fetching ...

Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources

Fomubad Borista Fondi, Azanzi Jiomekong Fidel, Gaoussou Camara

TL;DR

This work tackles question answering over heterogeneous scholarly data sources (DBLP, SemOpenAlex, and Wikipedia-based texts) within the ISWC 2024 QALD challenge. It proposes a hybrid pipeline that starts with SPARQL-based data retrieval, applies a divide-and-conquer strategy to manage diverse question types, and leverages a pre-trained extractive QA model (BERT-base-cased-SQuAD2) to generate and refine answers. Evaluation on the provided training/test sets via Codalab demonstrates that integrating SPARQL results with LLM-based predictions improves accuracy, particularly for complex, context-dependent author-related queries. The approach advances robust QA over large, interlinked scholarly knowledge graphs and text sources, with practical implications for scalable knowledge discovery and retrieval.

Abstract

The Scholarly Hybrid Question Answering over Linked Data (QALD) Challenge at the International Semantic Web Conference (ISWC) 2024 focuses on Question Answering (QA) over diverse scholarly sources: DBLP, SemOpenAlex, and Wikipedia-based texts. This paper describes a methodology that combines SPARQL queries, divide and conquer algorithms, and a pre-trained extractive question answering model. It starts with SPARQL queries to gather data, then applies divide and conquer to manage various question types and sources, and uses the model to handle personal author questions. The approach, evaluated with Exact Match and F-score metrics, shows promise for improving QA accuracy and efficiency in scholarly contexts.

Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources

TL;DR

This work tackles question answering over heterogeneous scholarly data sources (DBLP, SemOpenAlex, and Wikipedia-based texts) within the ISWC 2024 QALD challenge. It proposes a hybrid pipeline that starts with SPARQL-based data retrieval, applies a divide-and-conquer strategy to manage diverse question types, and leverages a pre-trained extractive QA model (BERT-base-cased-SQuAD2) to generate and refine answers. Evaluation on the provided training/test sets via Codalab demonstrates that integrating SPARQL results with LLM-based predictions improves accuracy, particularly for complex, context-dependent author-related queries. The approach advances robust QA over large, interlinked scholarly knowledge graphs and text sources, with practical implications for scalable knowledge discovery and retrieval.

Abstract

The Scholarly Hybrid Question Answering over Linked Data (QALD) Challenge at the International Semantic Web Conference (ISWC) 2024 focuses on Question Answering (QA) over diverse scholarly sources: DBLP, SemOpenAlex, and Wikipedia-based texts. This paper describes a methodology that combines SPARQL queries, divide and conquer algorithms, and a pre-trained extractive question answering model. It starts with SPARQL queries to gather data, then applies divide and conquer to manage various question types and sources, and uses the model to handle personal author questions. The approach, evaluated with Exact Match and F-score metrics, shows promise for improving QA accuracy and efficiency in scholarly contexts.
Paper Structure (12 sections, 1 equation, 8 figures)

This paper contains 12 sections, 1 equation, 8 figures.

Figures (8)

  • Figure 1: Methodology pipeline for the Scholarly Hybrid QALD Challenge.
  • Figure 2: Data processing and cleaning.
  • Figure 3: Example of the "CitedBy count" set creation.
  • Figure 4: Sample of how questions were approached
  • Figure 5: SPARQL query to retrieve author name
  • ...and 3 more figures