Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments
Liyuan Zhu, Shengyu Huang, Konrad Schindler, Iro Armeni
TL;DR
This work addresses the challenge of long-term dynamic 3D scene understanding with sparse temporal observations by proposing MoRE$^2$, a unified framework for living scenes. It introduces a compact SE($3$)-equivariant encoder–decoder based on Vector Neurons and a neural implicit DeepSDF decoder, enabling simultaneous instance matching, registration, and reconstruction, with a joint optimization that accumulates per-instance point clouds over time. The approach is trained on synthetic data and validated on both synthetic FlyingShape and real 3RScan datasets, achieving state-of-the-art end-to-end performance as well as improved subtask results for matching, registration, and reconstruction. The framework advances the concept of living scenes by progressively refining geometric completeness and pose accuracy as more temporal data becomes available, with potential applications in robotics, AR/VR, and digital twins.
Abstract
Research into dynamic 3D scene understanding has primarily focused on short-term change tracking from dense observations, while little attention has been paid to long-term changes with sparse observations. We address this gap with MoRE, a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as "living scenes" and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances, whose accuracy and completeness increase over time. At the core of our method lies an SE(3)-equivariant representation in a single encoder-decoder network, trained on synthetic data. This representation enables us to seamlessly tackle instance matching, registration, and reconstruction. We also introduce a joint optimization algorithm that facilitates the accumulation of point clouds originating from the same instance across multiple scans taken at different points in time. We validate our method on synthetic and real-world data and demonstrate state-of-the-art performance in both end-to-end performance and individual subtasks.
