Table of Contents
Fetching ...

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen

TL;DR

The paper tackles the challenge of region-level 4D understanding by introducing 4D-RGPT, a multimodal LLM augmented with training-time 4D perception distilled from a frozen expert via Perceptual 4D Distillation (P4D). P4D employs dual distillation—latent and explicit—to transfer low-level 4D signals and intermediate representations without increasing inference cost. To evaluate region-aware 4D reasoning, the authors propose R4D-Bench, a large region-prompted 4D VQA benchmark spanning static and dynamic scenes with 1,517 QA pairs. Across non-region 4D benchmarks and R4D-Bench, 4D-RGPT achieves state-of-the-art or competitive results among open-source models, with ablations confirming the effectiveness of P4D, Timestamp Positional Encoding, and diverse training data. Together, these contributions advance practical region-grounded 4D reasoning for real-world applications such as autonomous driving and industrial inspection.

Abstract

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

TL;DR

The paper tackles the challenge of region-level 4D understanding by introducing 4D-RGPT, a multimodal LLM augmented with training-time 4D perception distilled from a frozen expert via Perceptual 4D Distillation (P4D). P4D employs dual distillation—latent and explicit—to transfer low-level 4D signals and intermediate representations without increasing inference cost. To evaluate region-aware 4D reasoning, the authors propose R4D-Bench, a large region-prompted 4D VQA benchmark spanning static and dynamic scenes with 1,517 QA pairs. Across non-region 4D benchmarks and R4D-Bench, 4D-RGPT achieves state-of-the-art or competitive results among open-source models, with ablations confirming the effectiveness of P4D, Timestamp Positional Encoding, and diverse training data. Together, these contributions advance practical region-grounded 4D reasoning for real-world applications such as autonomous driving and industrial inspection.

Abstract

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Paper Structure

This paper contains 23 sections, 8 equations, 26 figures, 11 tables.

Figures (26)

  • Figure 1: Overview of Region-level 4D Understanding. 4D region-level VQA, e.g., our R4D-Bench, requires MLLMs to be able to track regions (2D), perceive depth (3D), and temporal progression (4D). Baseline MLLMs cannot recognize one or more of these aspects and thus fail to answer questions correctly. With our distillation framework, our 4D-RGPT better perceives these aspects and answers accurately. We note that the regions labeled with (*) are not provided in R4D-Bench; they are visualized for readability.
  • Figure 2: Perceptual 4D Distillation (P4D) framework for 4D-RGPT. For each frame ${\bm{I}}^{(i)}$ in ${\bm{V}}$, 4D-RGPT extracts 4D representations through training-only modules, i.e., ${\bm{\mathsfit{D}}}_{\tt 4DP}$ and ${\bm{\mathsfit{D}}}_m$ for $m \in {\mathcal{M}}$. This includes both latent features, i.e., $\hat{{\bm{F}}}_{\tt 4D}$, and explicit signals, e.g., depth $\hat{{\bm{P}}}_{\tt depth}$ or optical flow maps $\hat{{\bm{P}}}_{\tt flow}$. We also incorporate timestamp positional encodings (TPE) to provide temporal cues for 4D-RGPT to be temporally aware. In the P4D framework, the frozen teacher, i.e., 4D perception model, captures 4D expert knowledge from ${\bm{V}}$. It is then distilled to the student 4D-RGPT via two strategies. (a) Latent Distillation (LD): We align the latent $\hat{{\bm{F}}}_{\tt 4D}$ with the teacher's intermediate 4D embeddings ${\bm{F}}_{\tt 4D}$. (b) Explicit Distillation (ED): We align the explicit $\hat{{\bm{P}}}_{m}$ with the teacher's final 4D signals ${\bm{P}}_{m}$. 4D-RGPT is optimized end-to-end using both SFT loss and the distillation losses, i.e., ${\mathcal{L}}_{\tt LD}$ and ${\mathcal{L}}_{\tt ED}$.
  • Figure 3: Curation pipeline of our R4D-Bench. Given existing non-region 4D VQA benchmarks, we (a) first extract the noun keywords from the question as candidates for objects of interest. (b) Next, if ground truth segmentation masks are provided, we use them for step (d). Otherwise, we use off-the-shelf GroundingDINO liu2024groundingdino and SAM2 ravi2024sam2 to extract segmentation masks for each object of interest. (c) We generate a SoM yang2023som image for the first frame. (d) We prompt Qwen-2.5VL alibab2025qwen25vl with the SoM image and the processed question to match the objects referred to in the question with the regions. (e) Finally, the generated matching results are verified by human experts.
  • Figure 4: VQA comparison among baseline MLLMs and 4D-RGPT on R4D-Bench. For the baseline MLLMs, we use GPT-4o-20241120 openai2024gpt4o, Qwen-2.5VL-7B-Instruct alibab2025qwen25vl, and NVILA-Lite-8B liu2025nvila. We note that the regions labeled with (*) or (*) are not provided in R4D-Bench; they are visualized for readability.
  • Figure 5: Predicted depth maps at different training steps. We visualize the progress of $\hat{{\bm{P}}}_{\tt depth}$ throughout training.
  • ...and 21 more figures