COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu
TL;DR
COOPER presents a unified multimodal LLM that jointly learns to perceive and reason about 3D spatial relations by generating auxiliary modalities (depth and segmentation) and employing adaptive, interleaved reasoning. The two-stage training—auxiliary modality generation followed by SFT and CPR-guided RL—yields consistent gains in spatial reasoning and preserves general multimodal performance, with a notable 7.92% improvement on distance/size estimation when generation is fully internalized. The approach demonstrates the value of intrinsic multimodal CoT and adaptive modality usage, outperforming perception- and reasoning-only baselines across spatial benchmarks. These results suggest practical potential for robust 3D-aware reasoning in robotics, autonomous systems, and AR/VR applications, while pointing to future work in longer-horizon video reasoning and richer auxiliary modalities.
Abstract
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
