CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Guanghao Zhang; Tao Zhong; Yan Xia; Mushui Liu; Zhelun Yu; Haoyuan Li; Wanggui He; Fangxun Shu; Dong She; Yi Wang; Hao Jiang

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Guanghao Zhang, Tao Zhong, Yan Xia, Mushui Liu, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Dong She, Yi Wang, Hao Jiang

TL;DR

CMMCoT introduces a memory-augmented, slow-thinking framework for complex multi-image understanding, addressing limitations of end-to-end multimodal prediction. It combines interleaved multimodal sequence representations with a Retrieval-based Image Feature Reasoning Enhancement Module (RIFREM) to enable dynamic cross-image reasoning during inference. A new CMMCoT-260K dataset provides four reasoning task types (Caption, Co-reference, Comparison, Reason) to train and evaluate multi-image CoT. Across six benchmarks, including multi-image and single-image tasks, CMMCoT achieves state-of-the-art results and offers improved interpretability of intermediate reasoning steps. The work highlights the value of structured, memory-augmented reasoning for robust and explainable multi-modal AI systems.

Abstract

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: (1) The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. (2) The introduction of a test-time memory augmentation module that expands the model's reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model. Code is available at https://github.com/zhangguanghao523/CMMCoT.

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

TL;DR

Abstract

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)