Table of Contents
Fetching ...

Assessing SPARQL capabilities of Large Language Models

Lars-Peter Meyer, Johannes Frey, Felix Brei, Natanael Arndt

TL;DR

This work tackles the problem of evaluating out-of-the-box SPARQL SELECT capabilities of Large Language Models when interfaced with Knowledge Graphs. It introduces the LLM-KG-Bench framework with four task types (SSF, T2S, S2A, T2A) to probe syntax, semantics, and KG prompt influence across multiple models, including GPT, Gemini, and Claude. The findings show that while LLMs generally handle SPARQL syntax, generating semantically correct queries remains challenging and highly dependent on KG representation and prompting. The authors provide a reproducible benchmarking pipeline and public data, highlighting the need for diverse evaluation datasets and suggesting avenues for improvement through KG-aware prompting and potential fine-tuning.

Abstract

The integration of Large Language Models (LLMs) with Knowledge Graphs (KGs) offers significant synergistic potential for knowledge-driven applications. One possible integration is the interpretation and generation of formal languages, such as those used in the Semantic Web, with SPARQL being a core technology for accessing KGs. In this paper, we focus on measuring out-of-the box capabilities of LLMs to work with SPARQL and more specifically with SPARQL SELECT queries applying a quantitative approach. We implemented various benchmarking tasks in the LLM-KG-Bench framework for automated execution and evaluation with several LLMs. The tasks assess capabilities along the dimensions of syntax, semantic read, semantic create, and the role of knowledge graph prompt inclusion. With this new benchmarking tasks, we evaluated a selection of GPT, Gemini, and Claude models. Our findings indicate that working with SPARQL SELECT queries is still challenging for LLMs and heavily depends on the specific LLM as well as the complexity of the task. While fixing basic syntax errors seems to pose no problems for the best of the current LLMs evaluated, creating semantically correct SPARQL SELECT queries is difficult in several cases.

Assessing SPARQL capabilities of Large Language Models

TL;DR

This work tackles the problem of evaluating out-of-the-box SPARQL SELECT capabilities of Large Language Models when interfaced with Knowledge Graphs. It introduces the LLM-KG-Bench framework with four task types (SSF, T2S, S2A, T2A) to probe syntax, semantics, and KG prompt influence across multiple models, including GPT, Gemini, and Claude. The findings show that while LLMs generally handle SPARQL syntax, generating semantically correct queries remains challenging and highly dependent on KG representation and prompting. The authors provide a reproducible benchmarking pipeline and public data, highlighting the need for diverse evaluation datasets and suggesting avenues for improvement through KG-aware prompting and potential fine-tuning.

Abstract

The integration of Large Language Models (LLMs) with Knowledge Graphs (KGs) offers significant synergistic potential for knowledge-driven applications. One possible integration is the interpretation and generation of formal languages, such as those used in the Semantic Web, with SPARQL being a core technology for accessing KGs. In this paper, we focus on measuring out-of-the box capabilities of LLMs to work with SPARQL and more specifically with SPARQL SELECT queries applying a quantitative approach. We implemented various benchmarking tasks in the LLM-KG-Bench framework for automated execution and evaluation with several LLMs. The tasks assess capabilities along the dimensions of syntax, semantic read, semantic create, and the role of knowledge graph prompt inclusion. With this new benchmarking tasks, we evaluated a selection of GPT, Gemini, and Claude models. Our findings indicate that working with SPARQL SELECT queries is still challenging for LLMs and heavily depends on the specific LLM as well as the complexity of the task. While fixing basic syntax errors seems to pose no problems for the best of the current LLMs evaluated, creating semantically correct SPARQL SELECT queries is difficult in several cases.
Paper Structure (13 sections, 2 figures)