OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

Chong Xia; Fangfu Liu; Yule Wang; Yize Pang; Yueqi Duan

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

Chong Xia, Fangfu Liu, Yule Wang, Yize Pang, Yueqi Duan

TL;DR

OnlineX is introduced, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images and jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality.

Abstract

Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

TL;DR

Abstract

Paper Structure (32 sections, 11 equations, 6 figures, 6 tables)

This paper contains 32 sections, 11 equations, 6 figures, 6 tables.

Introduction
Related Work
Generalizable 3D Reconstruction.
3D Scene Understanding.
3D Online Paradigm.
Method
Problem Formulation
Gaussian Primitive Representation.
Rendering Process.
Generalizable Online Reconstruction.
Relative Geometry Extractor
Encoder and Decoder.
Relative Prediction Heads.
Anchor State Director
Recurrent Modeling.
...and 17 more sections

Figures (6)

Figure 1: We introduce OnlineX, a framework for continuous and progressive 3D scene reconstruction from streaming images. Our core contribution is a active-to-stable state evolution paradigm, which effectively mitigates long-term drift by decoupling the processing of high-fidelity active local details from the maintenance of a stable global structure.
Figure 2: Overall architecture of OnlineX. Our framework features a two-stage, active-to-stable pipeline. First, the Relative Geometry Extractor processes consecutive frames to capture high-fidelity active relative information. The Anchor State Director then uses this local information to recurrently update its stable global state, yielding a globally consistent representation for the final output. The diagram illustrates this process for a single time step, which would be sequentially repeated for each frame in the input stream. Dashed lines represent information passed from the previous time step or carried over to the next.
Figure 3: Qualitative comparison for novel view synthesis on RE10K (top two rows) and ScanNet (bottom two rows). We adopt the 4-view setting for RE10K and 15-view setting for ScanNet.
Figure 4: Qualitative comparison for semantic segmentation on ScanNet. Here we showcase one scene with 15 input views. The masks predicted by ours contain more complete regions than other methods, such as the "Wall" prompt, which also surpasses the GT masks.
Figure 5: Qualitative results for zero-shot generalization on DL3DV. Our model can easily transfer to out-of-distribution data.
...and 1 more figures

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

TL;DR

Abstract

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

Authors

TL;DR

Abstract

Table of Contents

Figures (6)