Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Xiaochen Wang; Heming Xia; Jialin Song; Longyu Guan; Yixin Yang; Qingxiu Dong; Weiyao Luo; Yifan Pu; Yiru Wang; Xiangdi Meng; Wenjie Li; Zhifang Sui

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Yixin Yang, Qingxiu Dong, Weiyao Luo, Yifan Pu, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui

TL;DR

This work introduces StripCipher, a benchmark designed to evaluate large multimodal models on sequential visual narratives within silent comic strips, addressing the gap left by single-image benchmarks. It defines three increasing-complexity tasks—contextual frame prediction, visual narrative comprehension, and temporal narrative reordering—and constructs a quality-controlled dataset via a human–AI annotation pipeline. An extensive evaluation of 16 LMMs, including GPT-4o and Qwen2.5VL, reveals a substantial AI–human gap, most pronounced in reordering, where the top model scores around 24% while humans perform significantly better. The analysis shows that input formats, model size, and fine-tuning influence performance but robust temporal reasoning remains a core challenge, underscoring the need for advances in temporal-visual understanding in LMMs.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of 16 state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

TL;DR

Abstract

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)