UMIE: Unified Multimodal Information Extraction with Instruction Tuning

Lin Sun; Kai Zhang; Qingyuan Li; Renze Lou

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou

TL;DR

UMIE tackles fragmentation in multimodal information extraction by unifying MNER, MRE, and MEE into a single generation framework guided by task instructions. It leverages a four-component architecture with a visual encoder and a gated cross-attention module, initialized from FLAN-T5, to fuse text and image information for structured outputs. Across six datasets and three tasks, UMIE achieves state-of-the-art results and demonstrates strong zero-shot generalization and robustness to instruction variants, underscoring its potential as a foundation model for MIE. The work also provides open datasets, code, and models, facilitating future research into instruction-tuned multimodal IE.

Abstract

Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 5 figures, 7 tables)

This paper contains 16 sections, 8 equations, 5 figures, 7 tables.

Introduction
Related Work
Unified Multimodal Information Extractor
Model Overview
Visual Encoder
Gated Attention Module
Task Instructor and Text Decoding
Experiments
Experiment Setup
Main Results
Zero-shot Generalization
Robustness to Instruction-Following
Gate Control Ablation
Training Materials
Conclusion
...and 1 more sections

Figures (5)

Figure 1: Unifying three key MIE tasks in a single multimodal model. Given a task instructor, UMIE performs the corresponding task by extracting textual and visual mentions (MNER and MEE) or inferring the relationship between two given mentions (MRE). $O_1$, $O_2$, and $O_3$ are visual objects.
Figure 2: Illustration of the UMIE model. The visual encoder encodes an image and objects into features that are dynamically integrated with textual features in the gated attention module and the text decoder generates information extraction results autoregressively.
Figure 3: Illustration of the gated attention module.
Figure 4: (a) Performances of UMIE-Base averaged over three MIE tasks w.r.t fixed gate value $g$ where D denotes the dynamic gate value of the UMIE; (b) $g$ value distribution of our dynamic gate module in three MIE tasks of UMIE-Base.
Figure 5: Averaged performance of three MIE tasks w.r.t training sampling ratios of Twitter and News corpora.

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

TL;DR

Abstract

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)