Table of Contents
Fetching ...

MixRED: A Mix-lingual Relation Extraction Dataset

Lingxing Kong, Yougang Chu, Zheng Ma, Jianbing Zhang, Liang He, Jiajun Chen

TL;DR

This work defines MixRE, the task of relation extraction in mix-lingual, code-switching settings, and introduces MixRED, the first human-annotated dataset for this scenario. It proposes a hierarchical mix framework (inter-sentence, intra-sentence, and entity-level) with language concentration control, supplemented by rigorous human annotation to ensure quality. A broad evaluation shows supervised RE models generally outperform LLMs on MixRED, and reveals how mix strategies and language concentration shape performance, guiding future improvements. The authors further demonstrate that mix-lingual exemplars and Mix-lingual Chain-of-Thought prompting can significantly boost LLM performance, offering practical directions for robust multilingual relation extraction.

Abstract

Relation extraction is a critical task in the field of natural language processing with numerous real-world applications. Existing research primarily focuses on monolingual relation extraction or cross-lingual enhancement for relation extraction. Yet, there remains a significant gap in understanding relation extraction in the mix-lingual (or code-switching) scenario, where individuals intermix contents from different languages within sentences, generating mix-lingual content. Due to the lack of a dedicated dataset, the effectiveness of existing relation extraction models in such a scenario is largely unexplored. To address this issue, we introduce a novel task of considering relation extraction in the mix-lingual scenario called MixRE and constructing the human-annotated dataset MixRED to support this task. In addition to constructing the MixRED dataset, we evaluate both state-of-the-art supervised models and large language models (LLMs) on MixRED, revealing their respective advantages and limitations in the mix-lingual scenario. Furthermore, we delve into factors influencing model performance within the MixRE task and uncover promising directions for enhancing the performance of both supervised models and LLMs in this novel task.

MixRED: A Mix-lingual Relation Extraction Dataset

TL;DR

This work defines MixRE, the task of relation extraction in mix-lingual, code-switching settings, and introduces MixRED, the first human-annotated dataset for this scenario. It proposes a hierarchical mix framework (inter-sentence, intra-sentence, and entity-level) with language concentration control, supplemented by rigorous human annotation to ensure quality. A broad evaluation shows supervised RE models generally outperform LLMs on MixRED, and reveals how mix strategies and language concentration shape performance, guiding future improvements. The authors further demonstrate that mix-lingual exemplars and Mix-lingual Chain-of-Thought prompting can significantly boost LLM performance, offering practical directions for robust multilingual relation extraction.

Abstract

Relation extraction is a critical task in the field of natural language processing with numerous real-world applications. Existing research primarily focuses on monolingual relation extraction or cross-lingual enhancement for relation extraction. Yet, there remains a significant gap in understanding relation extraction in the mix-lingual (or code-switching) scenario, where individuals intermix contents from different languages within sentences, generating mix-lingual content. Due to the lack of a dedicated dataset, the effectiveness of existing relation extraction models in such a scenario is largely unexplored. To address this issue, we introduce a novel task of considering relation extraction in the mix-lingual scenario called MixRE and constructing the human-annotated dataset MixRED to support this task. In addition to constructing the MixRED dataset, we evaluate both state-of-the-art supervised models and large language models (LLMs) on MixRED, revealing their respective advantages and limitations in the mix-lingual scenario. Furthermore, we delve into factors influencing model performance within the MixRE task and uncover promising directions for enhancing the performance of both supervised models and LLMs in this novel task.
Paper Structure (24 sections, 2 equations, 6 figures, 4 tables)

This paper contains 24 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: A real-world mix-lingual RE instance in both English and Chinese versions. Terms in the same color represent mentions of a specific entity.
  • Figure 2: The construction framework of MixRED. The percentages 30%, 50%, and 70% represent varying concentrations of content converted from English to Chinese during the creation of mix-lingual samples.
  • Figure 3: Distribution of samples in MixRED.
  • Figure 4: Distribution of relational triples for the top 40% relations in MixRED.
  • Figure 5: Development of mix-lingual exemplars and CoT for enhancing LLM performance.
  • ...and 1 more figures