Table of Contents
Fetching ...

LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs

Vincent Emonet, Jerven Bolleman, Severine Duvaud, Tarcisio Mendes de Farias, Ana Claudia Sima

TL;DR

Translating natural language questions into accurate federated SPARQL queries over evolving bioinformatics knowledge graphs is difficult due to scale and potential LLM hallucinations. The authors propose a Retrieval-Augmented Generation system that leverages endpoint metadata (example queries, VoID descriptions, and ShEx schemas) and a query-validation step to steer and correct generated queries. The work delivers an end-to-end architecture with embedding-based context retrieval, schema-guided validation, and open-source tooling to maintain endpoint metadata. This approach enables scalable, train-free querying across bioinformatics KGs like UniProt, Bgee, and OMA, with an online demo available at chat.expasy.org.

Abstract

We introduce a Retrieval-Augmented Generation (RAG) system for translating user questions into accurate federated SPARQL queries over bioinformatics knowledge graphs (KGs) leveraging Large Language Models (LLMs). To enhance accuracy and reduce hallucinations in query generation, our system utilises metadata from the KGs, including query examples and schema information, and incorporates a validation step to correct generated queries. The system is available online at chat.expasy.org.

LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs

TL;DR

Translating natural language questions into accurate federated SPARQL queries over evolving bioinformatics knowledge graphs is difficult due to scale and potential LLM hallucinations. The authors propose a Retrieval-Augmented Generation system that leverages endpoint metadata (example queries, VoID descriptions, and ShEx schemas) and a query-validation step to steer and correct generated queries. The work delivers an end-to-end architecture with embedding-based context retrieval, schema-guided validation, and open-source tooling to maintain endpoint metadata. This approach enables scalable, train-free querying across bioinformatics KGs like UniProt, Bgee, and OMA, with an online demo available at chat.expasy.org.

Abstract

We introduce a Retrieval-Augmented Generation (RAG) system for translating user questions into accurate federated SPARQL queries over bioinformatics knowledge graphs (KGs) leveraging Large Language Models (LLMs). To enhance accuracy and reduce hallucinations in query generation, our system utilises metadata from the KGs, including query examples and schema information, and incorporates a validation step to correct generated queries. The system is available online at chat.expasy.org.
Paper Structure (10 sections, 1 figure, 1 table)

This paper contains 10 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: LLM-based SPARQL Query Generator System Architecture.