Table of Contents
Fetching ...

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu

TL;DR

**CoE** is introduced, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG) that scaffolds cross-modal grounding and temporal reasoning.

Abstract

Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

TL;DR

**CoE** is introduced, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG) that scaffolds cross-modal grounding and temporal reasoning.

Abstract

Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.
Paper Structure (55 sections, 1 equation, 19 figures, 5 tables)

This paper contains 55 sections, 1 equation, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Motivating Experiments. Existing MMS models (e.g., MLASKkrubinski2023mlask and MMSumqiu2024mmsum) achieve strong in-domain results when trained on VIEWS ayyubi2024views, but their performance drops sharply under domain shift. In contrast, our training-freeCoE framework generalizes effectively across diverse datasets, maintaining stable zero-shot performance without task-specific training or adaptation.
  • Figure 2: Framework.CoE is a training-free CoT framework for MMS with text-only output. Given a video-text pair $(\bm{V}, \bm{T})$, Module 1 constructs a Hierarchical Event Graph (HEG) that organizes global events, sub-events, and entity-relation structures. Guided by the HEG, Module 2 grounds video clips to sub-events and their visual entity-relation graphs. Module 3 models event evolution by aggregating temporally coherent clips, and finally, Module 4 generates a domain-adaptive summary $\hat{s}$ from the resulting event trajectories.
  • Figure 3: Video Clip Aggregation.CoE merges adjacent clips grounded to the same sub-event and sharing identical entity-relation graphs into longer temporal segments. Red circles indicate new entities or sub-event changes that trigger a new segment.
  • Figure 4: Backbone Generalization. Performance across different backbones in terms of ROUGE score on eight datasets. For each backbone, we report the vanilla model and its CoE-based variant, denoted as CoE( · ). The consistent gains of CoE( · ) over the corresponding vanilla backbones demonstrate the strong generalization ability and robustness of our approach across diverse architectures.
  • Figure 5: Effect of Model Size. Performance of CoE with different MLLM backbones. We report results for Qwen2.5-VL models of increasing size (3B, 7B, 32B) and a proprietary GPT-5 model, showing a clear performance gain as model capacity grows across the four evaluation datasets.
  • ...and 14 more figures