Table of Contents
Fetching ...

DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

Jiawei Hou, Shenghao Zhang, Can Wang, Zheng Gu, Yonggen Ling, Taiping Zeng, Xiangyang Xue, Jingbo Zhang

TL;DR

DetAny4D tackles the challenge of reliable 4D object detection in streaming RGB video by introducing the DA4D dataset and an end-to-end open-set framework that predicts globally consistent 3D b-boxes across time. The model fuses multi-modal vision priors from SAM and DINOv2 with depth and camera cues through a geometry-aware Spatiotemporal Decoder and multi-task heads, trained with a sequence-aware strategy and specialized losses to enforce spatial and temporal consistency. Key contributions include the large-scale, spatiotemporally annotated DA4D dataset; the DetAny4D end-to-end 4D detector; and a training regime plus loss design that significantly reduces cross-frame variance while maintaining competitive 3D accuracy and open-set capability. Together, these advances enable robust long-term perception for streaming scenes and pave the way for scalable 4D perception in real-world systems.

Abstract

Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.

DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

TL;DR

DetAny4D tackles the challenge of reliable 4D object detection in streaming RGB video by introducing the DA4D dataset and an end-to-end open-set framework that predicts globally consistent 3D b-boxes across time. The model fuses multi-modal vision priors from SAM and DINOv2 with depth and camera cues through a geometry-aware Spatiotemporal Decoder and multi-task heads, trained with a sequence-aware strategy and specialized losses to enforce spatial and temporal consistency. Key contributions include the large-scale, spatiotemporally annotated DA4D dataset; the DetAny4D end-to-end 4D detector; and a training regime plus loss design that significantly reduces cross-frame variance while maintaining competitive 3D accuracy and open-set capability. Together, these advances enable robust long-term perception for streaming scenes and pave the way for scalable 4D perception in real-world systems.

Abstract

Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.

Paper Structure

This paper contains 26 sections, 9 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Comparison with existing methods. Existing 3D detectors predict on a frame-by-frame basis, which causes inconsistency when transforming 3D b-boxes into global coordinates. Current open-set 4D detectors typically address 3D predictions and cross-frame relationships in a multi-stage manner, which is complex and prone to error propagation across cascaded stages. In contrast, we propose an open-set end-to-end 4D detection benchmark that directly predicts globally consistent 3D b-boxes.
  • Figure 2: The data processing pipeline for 4D detection task. We record posed RGB frames sequentially and separate the records into fixed-length sequences. Objects in global coordinates are projected into ego view and filtered with policies to delete occluded and out-of-view objects. Objects b-boxes are then recalculated according to the visibility and accumulated considering the point cloud within the sequence. Finally, the coordinates of a sequence is adapted referring to the first frame.
  • Figure 3: Pipeline of the proposed DetAny4D model. RGB sequence with prompts are encoded with the feature extractor, generating tokens $T^t$, image embeddings $E_{img}^t$, and depth and camera-related embeddings $E_{d,m,c}^t$. A Geometry Context Transformer then inject 3D space embeddings in a transformer control manner, and together with embeddings decoded by the Spatiotemporal Transformer to generate prediction results. Multi-task heads are employed for effective training.
  • Figure 4: Visualization of GT b-box adaptation strategy (Section \ref{['sec:method:box_adaptation']}). Taking an L-shaped sofa as example, red b-box denotes original GT and green denotes the adapted one. (a) shows adaptation of b-box when camera moves from left to right (column 1) and right to left (column 2) when sofa is partially observed. (b) shows when fully observed, adapted b-box aligns with original GT.
  • Figure 5: Qualitative comparison with other methods on 3D b-box predictions across consecutive frames in a sequence. Our proposed DetAny4D predicts spatiotemporally aligned 3D b-boxes, while red circles and rectangles show inaccurate and inter-frame jtter predictions.
  • ...and 16 more figures