Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction

Meishan Zhang; Hao Fei; Bin Wang; Shengqiong Wu; Yixin Cao; Fei Li; Min Zhang

Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction

Meishan Zhang, Hao Fei, Bin Wang, Shengqiong Wu, Yixin Cao, Fei Li, Min Zhang

TL;DR

Grounded Multimodal Universal Information Extraction (MUIE) addresses the fragmentation of IE tasks by proposing a unified framework that outputs textual IE labels and fine-grained grounding across modalities. The authors introduce Reamo, a Vicuna-based multimodal LLM that leverages ImageBind for encoding and SEEM/SHAS-based grounding, and train it via UIE instruction tuning, multimodal alignment learning, grounding-aware tuning, and invocation-based meta-response tuning. They also curate a high-quality benchmark with 3,000 test instances across nine modality combinations for NER, RE, and EE, enabling comprehensive zero-shot evaluation. Experiments show Reamo outperforms existing MLLMs in end-to-end UIE and grounding across image, audio, video, and complex modality mixtures, establishing a strong early benchmark for grounded MUIE. Overall, the work advances practical, unified multimodal IE and provides publicly released resources to spur future research.

Abstract

In the field of information extraction (IE), tasks across a wide range of modalities and their combinations have been traditionally studied in isolation, leaving a gap in deeply recognizing and analyzing cross-modal information. To address this, this work for the first time introduces the concept of grounded Multimodal Universal Information Extraction (MUIE), providing a unified task framework to analyze any IE tasks over various modalities, along with their fine-grained groundings. To tackle MUIE, we tailor a multimodal large language model (MLLM), Reamo, capable of extracting and grounding information from all modalities, i.e., recognizing everything from all modalities at once. Reamo is updated via varied tuning strategies, equipping it with powerful capabilities for information recognition and fine-grained multimodal grounding. To address the absence of a suitable benchmark for grounded MUIE, we curate a high-quality, diverse, and challenging test set, which encompasses IE tasks across 9 common modality combinations with the corresponding multimodal groundings. The extensive comparison of Reamo with existing MLLMs integrated into pipeline approaches demonstrates its advantages across all evaluation dimensions, establishing a strong benchmark for the follow-up research. Our resources are publicly released at https://haofei.vip/MUIE.

Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction

TL;DR

Abstract

Paper Structure (43 sections, 2 equations, 7 figures, 6 tables)

This paper contains 43 sections, 2 equations, 7 figures, 6 tables.

Introduction
Related Works
Task Definition: Grounded Multimodal Universal Information Extraction
Our Proposed Model
MLLM Framework of Reamo
Multimodal Encoding.
LLM Reasoner.
MUIE Decoding with Grounding.
MUIE Fine-tuning for Reamo
UIE Instruction Tuning.
Multimodal Alignment Learning.
Fine-grained Cross-modal Grounding-aware Tuning.
Invocation-based Meta-response Tuning.
A Benchmark for Grounded MUIE
Data Source
...and 28 more sections

Figures (7)

Figure 1: Examples of grounded multimodal universal information extraction (MUIE).
Figure 2: An overview of the proposed Reamo MLLM architecture for grounded MUIE.
Figure 3: Performance gap between modality-shared (aligned) and modality-specific (unaligned) MUIE.
Figure 4: Impact of different object/entity numbers.
Figure 5: Qualitative result A on MUIE (NER) with modality-specific case via reasoning.
...and 2 more figures

Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction

TL;DR

Abstract

Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)