Table of Contents
Fetching ...

Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles

Abhijnan Nath, Huma Jamil, Shafiuddin Rehan Ahmed, George Baker, Rahul Ghosh, James H. Martin, Nathaniel Blanchard, Nikhil Krishnaswamy

TL;DR

The paper tackles cross-document event coreference resolution by incorporating multimodal cues through a low-compute bidirectional linear transfer between vision and language embeddings (Lin-Sem). It augments ECB+ with event-centric images (including diffusion-generated ones) and applies a three-pronged approach: a fusion baseline, a linear-mapping method, and an ensemble that partitions mention pairs by semantic and discourse-level difficulty. Empirical results on ECB+ and AIDA Phase 1 show that Lin-Sem-based ensembles can exceed text-only baselines (e.g., up to CoNLL F1 = 91.9 on ECB+) while remaining computationally efficient, highlighting the value of visual grounding for hard coreference cases. The work argues for more multimodal resources in coreference tasks and suggests future multilingual extensions and theoretical analyses of embedding-space mappings to support broader applicability and guarantees.

Abstract

Event coreference resolution (ECR) is the task of determining whether distinct mentions of events within a multi-document corpus are actually linked to the same underlying occurrence. Images of the events can help facilitate resolution when language is ambiguous. Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple linear map between vision and language models. As existing ECR benchmark datasets rarely provide images for all event mentions, we augment the popular ECB+ dataset with event-centric images scraped from the internet and generated using image diffusion models. We establish three methods that incorporate images and text for coreference: 1) a standard fused model with finetuning, 2) a novel linear mapping method without finetuning and 3) an ensembling approach based on splitting mention pairs by semantic and discourse-level difficulty. We evaluate on 2 datasets: the augmented ECB+, and AIDA Phase 1. Our ensemble systems using cross-modal linear mapping establish an upper limit (91.9 CoNLL F1) on ECB+ ECR performance given the preprocessing assumptions used, and establish a novel baseline on AIDA Phase 1. Our results demonstrate the utility of multimodal information in ECR for certain challenging coreference problems, and highlight a need for more multimodal resources in the coreference resolution space.

Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles

TL;DR

The paper tackles cross-document event coreference resolution by incorporating multimodal cues through a low-compute bidirectional linear transfer between vision and language embeddings (Lin-Sem). It augments ECB+ with event-centric images (including diffusion-generated ones) and applies a three-pronged approach: a fusion baseline, a linear-mapping method, and an ensemble that partitions mention pairs by semantic and discourse-level difficulty. Empirical results on ECB+ and AIDA Phase 1 show that Lin-Sem-based ensembles can exceed text-only baselines (e.g., up to CoNLL F1 = 91.9 on ECB+) while remaining computationally efficient, highlighting the value of visual grounding for hard coreference cases. The work argues for more multimodal resources in coreference tasks and suggests future multilingual extensions and theoretical analyses of embedding-space mappings to support broader applicability and guarantees.

Abstract

Event coreference resolution (ECR) is the task of determining whether distinct mentions of events within a multi-document corpus are actually linked to the same underlying occurrence. Images of the events can help facilitate resolution when language is ambiguous. Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple linear map between vision and language models. As existing ECR benchmark datasets rarely provide images for all event mentions, we augment the popular ECB+ dataset with event-centric images scraped from the internet and generated using image diffusion models. We establish three methods that incorporate images and text for coreference: 1) a standard fused model with finetuning, 2) a novel linear mapping method without finetuning and 3) an ensembling approach based on splitting mention pairs by semantic and discourse-level difficulty. We evaluate on 2 datasets: the augmented ECB+, and AIDA Phase 1. Our ensemble systems using cross-modal linear mapping establish an upper limit (91.9 CoNLL F1) on ECB+ ECR performance given the preprocessing assumptions used, and establish a novel baseline on AIDA Phase 1. Our results demonstrate the utility of multimodal information in ECR for certain challenging coreference problems, and highlight a need for more multimodal resources in the coreference resolution space.
Paper Structure (31 sections, 8 equations, 4 figures, 9 tables)

This paper contains 31 sections, 8 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Our approach for Multimodal CDCR using Lin-Sem. Linear Mapping ( Lin-Sem) procedure between the distinct text and image embedding spaces for an event pair in the ECB+ corpus. Arg1 and Arg2 refer to the individual images in the pair and the trigger events (in yellow) surrounded by the <m> and </m> special tokens embedded in the text-encoder (LLM).
  • Figure 2: Pairwise encoding time in GPU seconds (log-scale on y-axis) for text (Longformer), vision (ViT), and fused models vs. Bidirectional Linear Mapping ( Lin-Sem) as a function of the number of train pairs in ECB+.
  • Figure 3: Kernel Density Estimation plots of semantic-discourse similarity scores (including Wu-Palmer similarity) for mention pair difficulty categories in ECB+ (L) and AIDA Phase 1 (R), showing a clear demarcation of easy and hard pairs in positive and negative labels. easy_pos and hard_neg pairs have a high semantic similarity distribution while easy_neg and hard_pos pairs have lower semantic similarity distribution.
  • Figure 4: Sample coreferent event pairs from ECB+ that were correctly linked by our best multimodal ensemble ($\texttt{ViT}$-real→ $\texttt{LLM}$ + $\texttt{LLM}$ → $\texttt{BEiT}$-real + $\texttt{LLM}$), but not by the text-only model. Event-triggers are highlighted in yellow and text in italics illustrates lexical ambiguity or misleading lexical overlap.