DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video
Jiawei Hou, Shenghao Zhang, Can Wang, Zheng Gu, Yonggen Ling, Taiping Zeng, Xiangyang Xue, Jingbo Zhang
TL;DR
DetAny4D tackles the challenge of reliable 4D object detection in streaming RGB video by introducing the DA4D dataset and an end-to-end open-set framework that predicts globally consistent 3D b-boxes across time. The model fuses multi-modal vision priors from SAM and DINOv2 with depth and camera cues through a geometry-aware Spatiotemporal Decoder and multi-task heads, trained with a sequence-aware strategy and specialized losses to enforce spatial and temporal consistency. Key contributions include the large-scale, spatiotemporally annotated DA4D dataset; the DetAny4D end-to-end 4D detector; and a training regime plus loss design that significantly reduces cross-frame variance while maintaining competitive 3D accuracy and open-set capability. Together, these advances enable robust long-term perception for streaming scenes and pave the way for scalable 4D perception in real-world systems.
Abstract
Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.
