Table of Contents
Fetching ...

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu

TL;DR

COOPER presents a unified multimodal LLM that jointly learns to perceive and reason about 3D spatial relations by generating auxiliary modalities (depth and segmentation) and employing adaptive, interleaved reasoning. The two-stage training—auxiliary modality generation followed by SFT and CPR-guided RL—yields consistent gains in spatial reasoning and preserves general multimodal performance, with a notable 7.92% improvement on distance/size estimation when generation is fully internalized. The approach demonstrates the value of intrinsic multimodal CoT and adaptive modality usage, outperforming perception- and reasoning-only baselines across spatial benchmarks. These results suggest practical potential for robust 3D-aware reasoning in robotics, autonomous systems, and AR/VR applications, while pointing to future work in longer-horizon video reasoning and richer auxiliary modalities.

Abstract

Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

TL;DR

COOPER presents a unified multimodal LLM that jointly learns to perceive and reason about 3D spatial relations by generating auxiliary modalities (depth and segmentation) and employing adaptive, interleaved reasoning. The two-stage training—auxiliary modality generation followed by SFT and CPR-guided RL—yields consistent gains in spatial reasoning and preserves general multimodal performance, with a notable 7.92% improvement on distance/size estimation when generation is fully internalized. The approach demonstrates the value of intrinsic multimodal CoT and adaptive modality usage, outperforming perception- and reasoning-only baselines across spatial benchmarks. These results suggest practical potential for robust 3D-aware reasoning in robotics, autonomous systems, and AR/VR applications, while pointing to future work in longer-horizon video reasoning and richer auxiliary modalities.

Abstract

Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.

Paper Structure

This paper contains 39 sections, 9 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Comparison of three paradigms. (a) Query input: visual and the corresponding textual information. (b) Perception enhancement: augment the model with auxiliary modalities (e.g., depth, segmentation). (c) COOPER: a single model endowed with both capabilities that adaptively schedules when to perceive and when to reason during execution. (d) Reasoning enhancement: strengthen spatial reasoning via textual chain-of-thought. (e) Self-generated Multimodal CoT: an interleaved vision–language CoT generated by the unified reasoner.
  • Figure 2: Method details. The method consists of two stages: (a) Auxiliary Modality Generation. To equip the model with the ability to generate different types of auxiliary modalities, we convert all auxiliary-modality data into the RGB space and train the model to generate these modalities using the original image generation training pipeline. (b) Adaptive Interleaved Reasoning. Building on the model with auxiliary modality generation capability, we construct a balanced dataset and first apply supervised fine-tuning (SFT) to endow the model with basic interleaved reasoning. We then further enhance its reasoning and generalization ability using the CPR reward and GRPO.
  • Figure 3: Reasoning Analysis. (a) COOPER adaptively selects its reasoning mode across tasks: for RD (Relative Distance) and SQA (Situational QA), it more often generates auxiliary multimodal signals, while for GR (Geometric Reasoning) it relies more on purely textual reasoning. (b) and (c) show how COOPER chooses to generate depth maps or highlight target objects in segmentation maps according to the task, thereby assisting its own reasoning. Additional reasoning and failure cases are provided in the supplementary materials.
  • Figure 4: Segmentation Cases. Qualitative comparison between the COOPER and the ground-truth segmentation maps.
  • Figure 5: Depth estimation cases. Qualitative comparison between COOPER’s depth maps and the Marigold depth maps.
  • ...and 12 more figures