Table of Contents
Fetching ...

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, Xiaodong Cun

TL;DR

MLLM-4D is introduced, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning for multimodal large language models (MLLMs).

Abstract

Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

TL;DR

MLLM-4D is introduced, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning for multimodal large language models (MLLMs).

Abstract

Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.
Paper Structure (19 sections, 9 equations, 17 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 9 equations, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: We propose MLLM-4D, a method that advances MLLMs for the visual-based spatial-temporal intelligence. MLLM-4D is capable of understanding and reasoning about the evolution of 3D space over time from only 2D video input.
  • Figure 2: The components of our MLLM4D-Bench.
  • Figure 3: Our scalable curation pipeline for instructional spatiotemporal data. Our automated pipeline leverages several advanced vision techniques to extract 4D spatiotemporal information from stereoscopic videos, including per-frame camera poses, object-level 3D point clouds, and semantic descriptions. These data are then processed through a physics-based spatiotemporal relation solver to generate 4D QA pairs, and our specialized ST-CoT prompting strategy synthesizes the corresponding reasoning trajectories.
  • Figure 4: Our RFT pipeline. Given the input video and question, the MLLM-4D model generates multiple rollouts using the ST-CoT reasoning format. Within each group, relative advantages are computed based on accuracy reward, format reward and ST-reward. The model parameters are then updated via the GRPO objective, which incorporates a KL penalty relative to the frozen reference model.
  • Figure 5: Scalability of training data on SFT and RFT stage.
  • ...and 12 more figures