Table of Contents
Fetching ...

ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

Hailong Chu, Shuo Zhang, Yunlong Chu, Shutai Huang, Xingyue Zhang, Tinghe Yan, Jinsong Zhang, Lei Li

TL;DR

ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses and introduces a Link-then-Bind strategy that enforces deferred commitment.

Abstract

Multimedia Event Extraction (M2E2) involves extracting structured event records from both textual and visual content. Existing approaches, ranging from specialized architectures to direct Large Language Model (LLM) prompting, typically rely on a linear, end-to-end generation and thus suffer from cascading errors: early cross-modal misalignments often corrupt downstream role assignment under strict grounding constraints. We propose ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses. Unlike dialogue-centric frameworks, ECHO coordinates specialized agents by applying atomic hypergraph operations to the MEHG. Furthermore, we introduce a Link-then-Bind strategy that enforces deferred commitment: agents first identify relevant arguments and only then determine their precise roles, mitigating incorrect grounding and limiting error propagation. Extensive experiments on the M2E2 benchmark show that ECHO significantly outperforms the state-of-the-art (SOTA) : with Qwen3-32B, it achieves a 7.3% and 15.5% improvement in average event mention and argument role F1, respectively.

ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

TL;DR

ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses and introduces a Link-then-Bind strategy that enforces deferred commitment.

Abstract

Multimedia Event Extraction (M2E2) involves extracting structured event records from both textual and visual content. Existing approaches, ranging from specialized architectures to direct Large Language Model (LLM) prompting, typically rely on a linear, end-to-end generation and thus suffer from cascading errors: early cross-modal misalignments often corrupt downstream role assignment under strict grounding constraints. We propose ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses. Unlike dialogue-centric frameworks, ECHO coordinates specialized agents by applying atomic hypergraph operations to the MEHG. Furthermore, we introduce a Link-then-Bind strategy that enforces deferred commitment: agents first identify relevant arguments and only then determine their precise roles, mitigating incorrect grounding and limiting error propagation. Extensive experiments on the M2E2 benchmark show that ECHO significantly outperforms the state-of-the-art (SOTA) : with Qwen3-32B, it achieves a 7.3% and 15.5% improvement in average event mention and argument role F1, respectively.
Paper Structure (62 sections, 3 equations, 5 figures, 13 tables)

This paper contains 62 sections, 3 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Argument Role F1 performance on the M2E2 benchmark WASE using direct prompting. Text-only LLMs utilize Qwen3-VL-8B-Thinking to generate visual descriptions as additional input, while LVLMs process images directly. The SOTA model X-MTL X-MTL and our proposed ECHO are included for reference.
  • Figure 2: Overview of ECHO. Blue vertices are text entity mentions (surface spans in $T$) and green vertices are image object regions (bounding boxes in $I$); large circles are event hyperedges. Given $D=(T,I)$, Stage I constructs the vertex inventory and initializes an edge-free MEHG; Stage II agents negotiate MEHG updates via auditable atomic operations; Stage III performs role binding and consolidation/normalization to produce schema-consistent event predictions.
  • Figure 3: F1 comparison of Direct prompting, MetaGPT-style multi-agent baseline, and ECHO on M2E2 across textual, visual, and multimedia settings.
  • Figure 4: Ablation on the multimedia setting of M2E2. We report F1 for EM and AR with three backbones.
  • Figure 5: Negotiation budget analysis for Stage II under early stopping. (a) Proportion of samples that stop by round $k$ (i.e., $T_{\mathrm{used}} \le k$) when varying the maximum budget $T_{\max}$. (b) Distribution of the number of committed atomic operations per sample. (c) Sensitivity of extraction performance to $T_{\max}$, reported as $\Delta$F1 (percentage points) relative to $T_{\max}{=}2$.