Table of Contents
Fetching ...

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou

TL;DR

UMIE tackles fragmentation in multimodal information extraction by unifying MNER, MRE, and MEE into a single generation framework guided by task instructions. It leverages a four-component architecture with a visual encoder and a gated cross-attention module, initialized from FLAN-T5, to fuse text and image information for structured outputs. Across six datasets and three tasks, UMIE achieves state-of-the-art results and demonstrates strong zero-shot generalization and robustness to instruction variants, underscoring its potential as a foundation model for MIE. The work also provides open datasets, code, and models, facilitating future research into instruction-tuned multimodal IE.

Abstract

Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

TL;DR

UMIE tackles fragmentation in multimodal information extraction by unifying MNER, MRE, and MEE into a single generation framework guided by task instructions. It leverages a four-component architecture with a visual encoder and a gated cross-attention module, initialized from FLAN-T5, to fuse text and image information for structured outputs. Across six datasets and three tasks, UMIE achieves state-of-the-art results and demonstrates strong zero-shot generalization and robustness to instruction variants, underscoring its potential as a foundation model for MIE. The work also provides open datasets, code, and models, facilitating future research into instruction-tuned multimodal IE.

Abstract

Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE
Paper Structure (16 sections, 8 equations, 5 figures, 7 tables)

This paper contains 16 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Unifying three key MIE tasks in a single multimodal model. Given a task instructor, UMIE performs the corresponding task by extracting textual and visual mentions (MNER and MEE) or inferring the relationship between two given mentions (MRE). $O_1$, $O_2$, and $O_3$ are visual objects.
  • Figure 2: Illustration of the UMIE model. The visual encoder encodes an image and objects into features that are dynamically integrated with textual features in the gated attention module and the text decoder generates information extraction results autoregressively.
  • Figure 3: Illustration of the gated attention module.
  • Figure 4: (a) Performances of UMIE-Base averaged over three MIE tasks w.r.t fixed gate value $g$ where D denotes the dynamic gate value of the UMIE; (b) $g$ value distribution of our dynamic gate module in three MIE tasks of UMIE-Base.
  • Figure 5: Averaged performance of three MIE tasks w.r.t training sampling ratios of Twitter and News corpora.