MST5 -- Multilingual Question Answering over Knowledge Graphs
Nikit Srivastava, Mengshi Ma, Daniel Vollmers, Hamada Zahera, Diego Moussallem, Axel-Cyrille Ngonga Ngomo
TL;DR
KGQA remains English-centric, and MST5 addresses multilingual KGQA by using a single pretrained multilingual transformer (mT5 MT5) augmented with linguistic context and entity information to generate SPARQL queries end-to-end. Formally, given a natural language query $Q$ and auxiliary data $A$, MST5 seeks $\hat{S} = \arg\max_{S'} P(S'\mid Q, A; \theta)$ and minimizes the loss $\mathcal{L}(\theta) = -\log P(S\mid Q, A; \theta)$. The approach preprocesses SPARQL targets, concatenates features, and leverages attention to jointly model $Q$ and $A$. On QALD-9-Plus (updated) and QALD-10, MST5 variants outperform baselines (e.g., DeepPavlov-2023) across languages, with added coverage for Chinese and Japanese and open-source code facilitating replication; limitations include third-party tool dependencies and lower performance on some low-resource languages, pointing to multi-task learning as future work.
Abstract
Knowledge Graph Question Answering (KGQA) simplifies querying vast amounts of knowledge stored in a graph-based model using natural language. However, the research has largely concentrated on English, putting non-English speakers at a disadvantage. Meanwhile, existing multilingual KGQA systems face challenges in achieving performance comparable to English systems, highlighting the difficulty of generating SPARQL queries from diverse languages. In this research, we propose a simplified approach to enhance multilingual KGQA systems by incorporating linguistic context and entity information directly into the processing pipeline of a language model. Unlike existing methods that rely on separate encoders for integrating auxiliary information, our strategy leverages a single, pretrained multilingual transformer-based language model to manage both the primary input and the auxiliary data. Our methodology significantly improves the language model's ability to accurately convert a natural language query into a relevant SPARQL query. It demonstrates promising results on the most recent QALD datasets, namely QALD-9-Plus and QALD-10. Furthermore, we introduce and evaluate our approach on Chinese and Japanese, thereby expanding the language diversity of the existing datasets.
