Table of Contents
Fetching ...

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Yixin Yang, Qingxiu Dong, Weiyao Luo, Yifan Pu, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui

TL;DR

This work introduces StripCipher, a benchmark designed to evaluate large multimodal models on sequential visual narratives within silent comic strips, addressing the gap left by single-image benchmarks. It defines three increasing-complexity tasks—contextual frame prediction, visual narrative comprehension, and temporal narrative reordering—and constructs a quality-controlled dataset via a human–AI annotation pipeline. An extensive evaluation of 16 LMMs, including GPT-4o and Qwen2.5VL, reveals a substantial AI–human gap, most pronounced in reordering, where the top model scores around 24% while humans perform significantly better. The analysis shows that input formats, model size, and fine-tuning influence performance but robust temporal reasoning remains a core challenge, underscoring the need for advances in temporal-visual understanding in LMMs.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of 16 state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

TL;DR

This work introduces StripCipher, a benchmark designed to evaluate large multimodal models on sequential visual narratives within silent comic strips, addressing the gap left by single-image benchmarks. It defines three increasing-complexity tasks—contextual frame prediction, visual narrative comprehension, and temporal narrative reordering—and constructs a quality-controlled dataset via a human–AI annotation pipeline. An extensive evaluation of 16 LMMs, including GPT-4o and Qwen2.5VL, reveals a substantial AI–human gap, most pronounced in reordering, where the top model scores around 24% while humans perform significantly better. The analysis shows that input formats, model size, and fine-tuning influence performance but robust temporal reasoning remains a core challenge, underscoring the need for advances in temporal-visual understanding in LMMs.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of 16 state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.

Paper Structure

This paper contains 38 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: An example of three tasks: prediction, comprehension, and reordering from the StripCipher dataset. All tasks are presented as multiple-choice questions, with distractors excluded due to limited context.
  • Figure 2: Schematic diagram of StripCipher dataset construction process including three stages: Image Collection, Data Annotation and Cross Check. Only comprehension task is displayed, as Prediction follows the same process.
  • Figure 3: Sample outputs of our three tasks generated by different vision language models, along with gold truth. We highlight errors in distractors.
  • Figure 4: Comparison of the accuracy results between Qwen2.5-3B vs Qwen2.5-7B and LLaVA-1.6-7B vs LLaVA-1.6-13B vs LLaVA-1.6-34B
  • Figure 5: The distribution of six categories of StripCipher.
  • ...and 6 more figures