SceneFactory: A Workflow-centric and Unified Framework for Incremental Scene Modeling
Yijun Yuan, Michael Bleier, Andreas Nüchter
TL;DR
SceneFactory presents a unified, modular framework for incremental scene modeling that links tracking, depth estimation, and reconstruction in a dependency-driven workflow. It introduces four building blocks (tracking, flexion, depth estimation, reconstruction) and two novel components (DM-NPs for Surface Light Fields and IPR for fast surface querying) to support a wide range of inputs, including unposed/un calibrated multi-view data and RGB-LiDAR streams. The depth module ($U^2$-MVD) combines dense correspondences, robust checks, and DBA with a ScaleCov depth completion pipeline, enabling both RGB-D and unposed multi-view depth estimation, while reconstruction uses online learning of DM-NPs for high-quality color and surface representations. The paper also provides a new RGB-X dense monocular SLAM dataset and demonstrates competitive performance against state-of-the-art methods on diverse benchmarks, highlighting the framework’s flexibility, scalability, and potential for real-time, large-scale scene modeling. Overall, SceneFactory offers a practical, extensible pathway toward unified, production-line like scene modeling across varied sensing modalities and tasks.
Abstract
We present SceneFactory, a workflow-centric and unified framework for incremental scene modeling, that conveniently supports a wide range of applications, such as (unposed and/or uncalibrated) multi-view depth estimation, LiDAR completion, (dense) RGB-D/RGB-L/Mono/Depth-only reconstruction and SLAM. The workflow-centric design uses multiple blocks as the basis for constructing different production lines. The supported applications, i.e., productions avoid redundancy in their designs. Thus, the focus is placed on each block itself for independent expansion. To support all input combinations, our implementation consists of four building blocks that form SceneFactory: (1) tracking, (2) flexion, (3) depth estimation, and (4) scene reconstruction. The tracking block is based on Mono SLAM and is extended to support RGB-D and RGB-LiDAR (RGB-L) inputs. Flexion is used to convert the depth image (untrackable) into a trackable image. For general-purpose depth estimation, we propose an unposed \& uncalibrated multi-view depth estimation model (U$^2$-MVD) to estimate dense geometry. U$^2$-MVD exploits dense bundle adjustment to solve for poses, intrinsics, and inverse depth. A semantic-aware ScaleCov step is then introduced to complete the multi-view depth. Relying on U$^2$-MVD, SceneFactory both supports user-friendly 3D creation (with just images) and bridges the applications of Dense RGB-D and Dense Mono. For high-quality surface and color reconstruction, we propose Dual-purpose Multi-resolutional Neural Points (DM-NPs) for the first surface accessible Surface Color Field design, where we introduce Improved Point Rasterization (IPR) for point cloud based surface query. ...
