Assessing LLMs Suitability for Knowledge Graph Completion
Vasile Ionut Remus Iga, Gheorghe Cosmin Silaghi
TL;DR
This work evaluates the suitability of Large Language Models for Knowledge Graph Completion in static knowledge graphs embedded within Task-Oriented Dialogue systems. It compares Mixtral-8x7b-Instruct-v0.1, GPT-3.5-Turbo-0125, and GPT-4o using TELeR-based prompting across Zero- and One-Shot settings on two domain-tailored datasets, with both strict and flexible evaluation metrics. The contributions include two personalized KGC datasets, a structured prompting framework, and a flexible post-processing metric scheme, revealing that GPT-4o is the most reliable across settings while Mixtral struggles with strict formats. The findings inform practical integration of KG completion in ontology-enhanced TOD systems and point toward retrieval-augmented generation and careful prompting as promising directions for robust, cost-aware deployment.
Abstract
Recent work has shown the capability of Large Language Models (LLMs) to solve tasks related to Knowledge Graphs, such as Knowledge Graph Completion, even in Zero- or Few-Shot paradigms. However, they are known to hallucinate answers, or output results in a non-deterministic manner, thus leading to wrongly reasoned responses, even if they satisfy the user's demands. To highlight opportunities and challenges in knowledge graphs-related tasks, we experiment with three distinguished LLMs, namely Mixtral-8x7b-Instruct-v0.1, GPT-3.5-Turbo-0125 and GPT-4o, on Knowledge Graph Completion for static knowledge graphs, using prompts constructed following the TELeR taxonomy, in Zero- and One-Shot contexts, on a Task-Oriented Dialogue system use case. When evaluated using both strict and flexible metrics measurement manners, our results show that LLMs could be fit for such a task if prompts encapsulate sufficient information and relevant examples.
