4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
TL;DR
The paper tackles the challenge of region-level 4D understanding by introducing 4D-RGPT, a multimodal LLM augmented with training-time 4D perception distilled from a frozen expert via Perceptual 4D Distillation (P4D). P4D employs dual distillation—latent and explicit—to transfer low-level 4D signals and intermediate representations without increasing inference cost. To evaluate region-aware 4D reasoning, the authors propose R4D-Bench, a large region-prompted 4D VQA benchmark spanning static and dynamic scenes with 1,517 QA pairs. Across non-region 4D benchmarks and R4D-Bench, 4D-RGPT achieves state-of-the-art or competitive results among open-source models, with ablations confirming the effectiveness of P4D, Timestamp Positional Encoding, and diverse training data. Together, these contributions advance practical region-grounded 4D reasoning for real-world applications such as autonomous driving and industrial inspection.
Abstract
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
