Leveraging small language models for Text2SPARQL tasks to improve the resilience of AI assistance
Felix Brei, Johannes Frey, Lars-Peter Meyer
TL;DR
This study investigates whether sub-1B parameter language models can be fine-tuned to translate natural language questions into SPARQL queries, enabling offline AI-assisted querying on commodity hardware. It evaluates three model families (BART, M2M100, NLLB-200) across three datasets (Organizational Graph, CoyPu subset, QALD10) and finds that BART and M2M100 variants can learn NL-to-SPARQL mappings to varying degrees, while T5 variants underperform on these tasks. Results reveal strong dataset dependence and highlight issues such as entity linking and prefix requirements, underscoring prerequisites for effective fine-tuning and data generation. The work suggests practical implications for resilient, on-device semantic web tools and outlines future directions including integration with retrieval-augmented generation, open-source data generation, and model optimization for on-device deployment.
Abstract
In this work we will show that language models with less than one billion parameters can be used to translate natural language to SPARQL queries after fine-tuning. Using three different datasets ranging from academic to real world, we identify prerequisites that the training data must fulfill in order for the training to be successful. The goal is to empower users of semantic web technology to use AI assistance with affordable commodity hardware, making them more resilient against external factors.
