Table of Contents
Fetching ...

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Xiaoyu Zhou, Jingqi Wang, Yuang Jia, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang

TL;DR

EA3D tackles online open-world 3D object extraction from streaming video by unifying knowledge extraction with online Gaussian representations. It combines Vision-Language Models and multi-level VFMs to create a knowledge-integrated feature map, which is embedded into Gaussians and updated online via visual odometry and a recurrent joint optimization that fuses geometry with semantic knowledge. The approach enables simultaneous, multi-task 3D reconstruction and scene understanding without geometric or pose priors, achieving robust performance across rendering, semantic/instance segmentation, 3D bounding boxes, semantic occupancy, and mesh generation. Experiments on LERF and ScanNet demonstrate strong online performance, efficiency, and resilience to sparse views, highlighting the method’s potential for real-time open-world perception and downstream manipulation tasks.

Abstract

Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

TL;DR

EA3D tackles online open-world 3D object extraction from streaming video by unifying knowledge extraction with online Gaussian representations. It combines Vision-Language Models and multi-level VFMs to create a knowledge-integrated feature map, which is embedded into Gaussians and updated online via visual odometry and a recurrent joint optimization that fuses geometry with semantic knowledge. The approach enables simultaneous, multi-task 3D reconstruction and scene understanding without geometric or pose priors, achieving robust performance across rendering, semantic/instance segmentation, 3D bounding boxes, semantic occupancy, and mesh generation. Experiments on LERF and ScanNet demonstrate strong online performance, efficiency, and resilience to sparse views, highlighting the method’s potential for real-time open-world perception and downstream manipulation tasks.

Abstract

Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

Paper Structure

This paper contains 22 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration of ExtractAnything3D (EA3D), which enables online open-world 3D object extraction. Given a streaming video as input with unknown geometry, pose, or semantics, EA3D performs online and simultaneous scene interpretation and geometry reconstruction, enabling multi-task understanding and modeling of any 3D objects in the scene.
  • Figure 2: Framework of EA3D. Given a streaming video without poses or labels, EA3D first leverages VLMs to identify all potential objects and their physical attributes, while maintaining a dynamic semantic cache to track newly emerging categories. We then use multi-level VFMs to extract knowledge-integrated feature maps from each frame and embed them into Gaussian primitives via a feedforward way. We perform online visual odometry estimation, and incrementally reconstruct geometry and infer knowledge through our online feature Gaussians. A recurrent joint optimization fuses current observations with historical features to continuously update the Gaussians. EA3D supports a wide range of 3D perception tasks and shows strong potential for downstream applications.
  • Figure 3: Visualization of online Gaussian on Scannet dai2017scannet. EA3D processes streaming video to incrementally reconstruct while understanding. Historical features guide fast reasoning of current semantics and geometry, while new observations recurrently refine ambiguities and occlusions.
  • Figure 4: Visualization performance and model efficiency comparison with state-of-the-art methods. Left (a): Under the more challenging streaming setting without pose input, EA3D delivers high-quality 3D object reconstruction and rendering. Notably, our method avoids redundant Gaussian features through efficient online updates, enabling more precise and lightweight optimization. Right (b): EA3D strikes a balance between speed and quality, significantly reducing training time while maintaining high-performance scene understanding.
  • Figure I: Visualization of Semantic-aware splatting to 3D Bbox and Semantic Occupancy.
  • ...and 4 more figures