Table of Contents
Fetching ...

Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models

Xiaolin Xing, Zhiwei He, Haoyu Xu, Xing Wang, Rui Wang, Yu Hong

TL;DR

This work addresses cross-lingual inconsistencies in large language models by constructing MAKQA, a multilingual aligned knowledge QA dataset, and introducing three evaluation axes: Cross-lingual Semantic Consistency (xSC) via LaBSE, Cross-lingual Accuracy Consistency (xAC), and Cross-lingual Timeliness Consistency (xTC). The combined Cross-lingual Consistency (xC) metric provides a holistic view of model performance across language pairs. Empirical results show substantial cross-lingual inconsistency, with GPT-3.5 leading among tested models but still far from an Oracle, and reveal a positive link between cross-lingual consistency and multilingual translation ability. These findings highlight avenues to improve multilingual robustness and interpretability in LLMs, and suggest that enhancing translation capabilities could bolster cross-lingual consistency in knowledge handling.

Abstract

This paper investigates the cross-lingual inconsistencies observed in Large Language Models (LLMs), such as ChatGPT, Llama, and Baichuan, which have shown exceptional performance in various Natural Language Processing (NLP) tasks. Despite their successes, these models often exhibit significant inconsistencies when processing the same concepts across different languages. This study focuses on three primary questions: the existence of cross-lingual inconsistencies in LLMs, the specific aspects in which these inconsistencies manifest, and the correlation between cross-lingual consistency and multilingual capabilities of LLMs.To address these questions, we propose an innovative evaluation method for Cross-lingual Semantic Consistency (xSC) using the LaBSE model. We further introduce metrics for Cross-lingual Accuracy Consistency (xAC) and Cross-lingual Timeliness Consistency (xTC) to comprehensively assess the models' performance regarding semantic, accuracy, and timeliness inconsistencies. By harmonizing these metrics, we provide a holistic measurement of LLMs' cross-lingual consistency. Our findings aim to enhance the understanding and improvement of multilingual capabilities and interpretability in LLMs, contributing to the development of more robust and reliable multilingual language models.

Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models

TL;DR

This work addresses cross-lingual inconsistencies in large language models by constructing MAKQA, a multilingual aligned knowledge QA dataset, and introducing three evaluation axes: Cross-lingual Semantic Consistency (xSC) via LaBSE, Cross-lingual Accuracy Consistency (xAC), and Cross-lingual Timeliness Consistency (xTC). The combined Cross-lingual Consistency (xC) metric provides a holistic view of model performance across language pairs. Empirical results show substantial cross-lingual inconsistency, with GPT-3.5 leading among tested models but still far from an Oracle, and reveal a positive link between cross-lingual consistency and multilingual translation ability. These findings highlight avenues to improve multilingual robustness and interpretability in LLMs, and suggest that enhancing translation capabilities could bolster cross-lingual consistency in knowledge handling.

Abstract

This paper investigates the cross-lingual inconsistencies observed in Large Language Models (LLMs), such as ChatGPT, Llama, and Baichuan, which have shown exceptional performance in various Natural Language Processing (NLP) tasks. Despite their successes, these models often exhibit significant inconsistencies when processing the same concepts across different languages. This study focuses on three primary questions: the existence of cross-lingual inconsistencies in LLMs, the specific aspects in which these inconsistencies manifest, and the correlation between cross-lingual consistency and multilingual capabilities of LLMs.To address these questions, we propose an innovative evaluation method for Cross-lingual Semantic Consistency (xSC) using the LaBSE model. We further introduce metrics for Cross-lingual Accuracy Consistency (xAC) and Cross-lingual Timeliness Consistency (xTC) to comprehensively assess the models' performance regarding semantic, accuracy, and timeliness inconsistencies. By harmonizing these metrics, we provide a holistic measurement of LLMs' cross-lingual consistency. Our findings aim to enhance the understanding and improvement of multilingual capabilities and interpretability in LLMs, contributing to the development of more robust and reliable multilingual language models.
Paper Structure (26 sections, 4 equations, 3 figures, 6 tables)

This paper contains 26 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Cross-Lingual Inconsistencies in LLM Responses.
  • Figure 2: LLM performance in multilingual translation and average xSC score distribution.
  • Figure 3: LLM performance in multi-language translation and average xAC score distribution.