Table of Contents
Fetching ...

MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

Chengguang Gan, Sunbowen Lee, Qingyu Yin, Xinyang He, Hanjun Wei, Yunhao Liang, Younghun Lim, Shijian Wang, Hexiang Huang, Qinghao Zhang, Shiwen Ni, Tatsunori Mori

TL;DR

The paper addresses the limitation of MRE-focused information extraction data by creating MMM, a multilingual 21-subdataset collection across English, Japanese, and Chinese, and by introducing an LLM-assisted translation pipeline. It constructs a new open-domain NER dataset (TCONER) and trains OIELLM, a unified model that outputs both text-level labels and word-level label-entities, achieving competitive gains across MMM tasks. The work demonstrates that multilingual, multitask pretraining and carefully crafted input-output formats can enhance open-domain IE performance, enabling broader cross-lingual research and practical applications. By open-sourcing MMM and OIELLM, the authors provide a valuable resource to accelerate multilingual IE research and reduce annotation overhead, especially for underrepresented languages.

Abstract

The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets. Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance. The OIELLM model and datasets is open-source in HuggingFace: https://ganchengguang.github.io/MRE/

MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

TL;DR

The paper addresses the limitation of MRE-focused information extraction data by creating MMM, a multilingual 21-subdataset collection across English, Japanese, and Chinese, and by introducing an LLM-assisted translation pipeline. It constructs a new open-domain NER dataset (TCONER) and trains OIELLM, a unified model that outputs both text-level labels and word-level label-entities, achieving competitive gains across MMM tasks. The work demonstrates that multilingual, multitask pretraining and carefully crafted input-output formats can enhance open-domain IE performance, enabling broader cross-lingual research and practical applications. By open-sourcing MMM and OIELLM, the authors provide a valuable resource to accelerate multilingual IE research and reduce annotation overhead, especially for underrepresented languages.

Abstract

The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets. Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance. The OIELLM model and datasets is open-source in HuggingFace: https://ganchengguang.github.io/MRE/
Paper Structure (16 sections, 3 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 3 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: The Mutual Reinforcement Effect between the labels of Word-level labels and text-level label within a same text. A word-level IE task is a Point, and a text-level IE task is a Line. There is Mutual Reinforcement Effect between the point and the line.
  • Figure 2: Multilingual Mutual Reinforcement Effect Mix Datasets Names of all sub-datasets. (The image does not represent a percentage of the actual subdataset size.)
  • Figure 3: The format of MMM datasets.
  • Figure 4: The overview of dataset translation framework.
  • Figure 5: The input and output of Open-domain Information Extraction Large Language Model (OIELLM).
  • ...and 3 more figures