Evaluating Large Language Models in Semantic Parsing for Conversational Question Answering over Knowledge Graphs

Phillip Schneider; Manuel Klettner; Kristiina Jokinen; Elena Simperl; Florian Matthes

Evaluating Large Language Models in Semantic Parsing for Conversational Question Answering over Knowledge Graphs

Phillip Schneider, Manuel Klettner, Kristiina Jokinen, Elena Simperl, Florian Matthes

TL;DR

The paper addresses the problem of converting multi-turn conversational questions into executable SPARQL over a knowledge graph, focusing on LLMs not explicitly trained for semantic parsing. It evaluates four LLMs under zero-shot and few-shot prompting, plus a LoRA-based fine-tuning regime, on the SPICE benchmark with both automatic metrics ($F1$, $ACC$, $EM$) and human judgments. Key findings show that while small, fine-tuned models (e.g., LoRA-7B-512) can achieve high exact-match on SPARQL queries, larger models like GPT-3.5-Turbo excel in zero-/few-shot settings, and common errors include off-prompt outputs and syntax issues, especially on complex questions. The study demonstrates the practical viability of end-to-end SPARQL generation from dialogues, offers guidance on prompting and fine-tuning to mitigate errors, and points to future work on additional query languages and multilingual benchmarks.

Abstract

Conversational question answering systems often rely on semantic parsing to enable interactive information retrieval, which involves the generation of structured database queries from a natural language input. For information-seeking conversations about facts stored within a knowledge graph, dialogue utterances are transformed into graph queries in a process that is called knowledge-based conversational question answering. This paper evaluates the performance of large language models that have not been explicitly pre-trained on this task. Through a series of experiments on an extensive benchmark dataset, we compare models of varying sizes with different prompting techniques and identify common issue types in the generated output. Our results demonstrate that large language models are capable of generating graph queries from dialogues, with significant improvements achievable through few-shot prompting and fine-tuning techniques, especially for smaller models that exhibit lower zero-shot performance.

Evaluating Large Language Models in Semantic Parsing for Conversational Question Answering over Knowledge Graphs

TL;DR

) and human judgments. Key findings show that while small, fine-tuned models (e.g., LoRA-7B-512) can achieve high exact-match on SPARQL queries, larger models like GPT-3.5-Turbo excel in zero-/few-shot settings, and common errors include off-prompt outputs and syntax issues, especially on complex questions. The study demonstrates the practical viability of end-to-end SPARQL generation from dialogues, offers guidance on prompting and fine-tuning to mitigate errors, and points to future work on additional query languages and multilingual benchmarks.

Abstract

Paper Structure (10 sections, 4 tables)

This paper contains 10 sections, 4 tables.

INTRODUCTION
RELATED WORK
EXPERIMENTAL SETUP
Benchmark Dataset
Models
RESULTS AND DISCUSSION
Automatic Evaluation Results
Human Evaluation Results
Discussion
Conclusion

Evaluating Large Language Models in Semantic Parsing for Conversational Question Answering over Knowledge Graphs

TL;DR

Abstract

Evaluating Large Language Models in Semantic Parsing for Conversational Question Answering over Knowledge Graphs

Authors

TL;DR

Abstract

Table of Contents