Table of Contents
Fetching ...

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong

TL;DR

RMPL is proposed, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions that incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training.

Abstract

Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

TL;DR

RMPL is proposed, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions that incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training.

Abstract

Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.
Paper Structure (24 sections, 7 equations, 3 figures, 4 tables)

This paper contains 24 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An example of MEE from M2E2 li-etal-2020-cross. The Justice: Arrest-Jail event is recognized from multimedia event content with paired text and image inputs, triggered by "detained", and event arguments are classified across modalities.
  • Figure 2: Overview of RMPL. Stage I conducts unified schema warm-up training with heterogeneous event-centric supervision to learn general event representations. Stage II further performs task-specific supervised training for event mention identification and argument role extraction across modalities, and finally evaluates the resulting models on the M2E2 benchmark.
  • Figure 3: Impact of different supervision mixing proportions.