EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang
TL;DR
EA3D tackles online open-world 3D object extraction from streaming video by unifying knowledge extraction with online Gaussian representations. It combines Vision-Language Models and multi-level VFMs to create a knowledge-integrated feature map, which is embedded into Gaussians and updated online via visual odometry and a recurrent joint optimization that fuses geometry with semantic knowledge. The approach enables simultaneous, multi-task 3D reconstruction and scene understanding without geometric or pose priors, achieving robust performance across rendering, semantic/instance segmentation, 3D bounding boxes, semantic occupancy, and mesh generation. Experiments on LERF and ScanNet demonstrate strong online performance, efficiency, and resilience to sparse views, highlighting the method’s potential for real-time open-world perception and downstream manipulation tasks.
Abstract
Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.
