ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Authors
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng
Abstract
Multimodal reasoning requires iterative coordination between language and
vision, yet it remains unclear what constitutes a meaningful interleaved chain
of thought. We posit that text and image thoughts should function as
complementary rather than isomorphic modalities that mutually advance
reasoning. Guided by this principle, we build ThinkMorph, a unified model
fine-tuned on approximately 24K high-quality interleaved reasoning traces
spanning tasks with varying visual engagement. ThinkMorph learns to generate
progressive text-image reasoning steps that concretely manipulate visual
content while maintaining coherent verbal logic. It delivers large gains on
vision-centric benchmarks (averaging 34.7 percent over the base model) and
generalizes to out-of-domain tasks, matching or surpassing larger and
proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal
intelligence, including unseen visual manipulation skills, adaptive switching
between reasoning modes, and better test-time scaling through diversified
multimodal thoughts. These findings suggest promising directions for
characterizing the emergent capabilities of unified models for multimodal
reasoning.