Table of Contents
Fetching ...

SMORE: Simultaneous Map and Object REconstruction

Nathaniel Chodosh, Anish Madan, Simon Lucey, Deva Ramanan

TL;DR

SMORE tackles dynamic scene reconstruction in large-scale urban LiDAR data by decomposing the scene into rigidly moving objects and a static background and optimizing both geometry and motion. It uses a global 3D point-to-surface objective minimized via coordinate descent, with a rolling-shutter deskewing model that extends to dynamic actors, enabling accurate reconstructions without retraining. The method achieves order-of-magnitude improvements over prior art in LiDAR novel-view synthesis and demonstrates robust ego- and actor-pose estimation, with practical applications in auto-labeling depth completion and scene flow. This approach enables high-fidelity, time-consistent reconstructions suitable for downstream perception tasks in autonomous driving.

Abstract

We present a method for dynamic surface reconstruction of large-scale urban scenes from LiDAR. Depth-based reconstructions tend to focus on small-scale objects or large-scale SLAM reconstructions that treat moving objects as outliers. We take a holistic perspective and optimize a compositional model of a dynamic scene that decomposes the world into rigidly-moving objects and the background. To achieve this, we take inspiration from recent novel view synthesis methods and frame the reconstruction problem as a global optimization over neural surfaces, ego poses, and object poses, which minimizes the error between composed spacetime surfaces and input LiDAR scans. In contrast to view synthesis methods, which typically minimize 2D errors with gradient descent, we minimize a 3D point-to-surface error by coordinate descent, which we decompose into registration and surface reconstruction steps. Each step can be handled well by off-the-shelf methods without any re-training. We analyze the surface reconstruction step for rolling-shutter LiDARs, and show that deskewing operations common in continuous time SLAM can be applied to dynamic objects as well, improving results over prior art by an order of magnitude. Beyond pursuing dynamic reconstruction as a goal in and of itself, we propose that such a system can be used to auto-label partially annotated sequences and produce ground truth annotation for hard-to-label problems such as depth completion and scene flow. Please see https://anishmadan23.github.io/smore/ for more visual results.

SMORE: Simultaneous Map and Object REconstruction

TL;DR

SMORE tackles dynamic scene reconstruction in large-scale urban LiDAR data by decomposing the scene into rigidly moving objects and a static background and optimizing both geometry and motion. It uses a global 3D point-to-surface objective minimized via coordinate descent, with a rolling-shutter deskewing model that extends to dynamic actors, enabling accurate reconstructions without retraining. The method achieves order-of-magnitude improvements over prior art in LiDAR novel-view synthesis and demonstrates robust ego- and actor-pose estimation, with practical applications in auto-labeling depth completion and scene flow. This approach enables high-fidelity, time-consistent reconstructions suitable for downstream perception tasks in autonomous driving.

Abstract

We present a method for dynamic surface reconstruction of large-scale urban scenes from LiDAR. Depth-based reconstructions tend to focus on small-scale objects or large-scale SLAM reconstructions that treat moving objects as outliers. We take a holistic perspective and optimize a compositional model of a dynamic scene that decomposes the world into rigidly-moving objects and the background. To achieve this, we take inspiration from recent novel view synthesis methods and frame the reconstruction problem as a global optimization over neural surfaces, ego poses, and object poses, which minimizes the error between composed spacetime surfaces and input LiDAR scans. In contrast to view synthesis methods, which typically minimize 2D errors with gradient descent, we minimize a 3D point-to-surface error by coordinate descent, which we decompose into registration and surface reconstruction steps. Each step can be handled well by off-the-shelf methods without any re-training. We analyze the surface reconstruction step for rolling-shutter LiDARs, and show that deskewing operations common in continuous time SLAM can be applied to dynamic objects as well, improving results over prior art by an order of magnitude. Beyond pursuing dynamic reconstruction as a goal in and of itself, we propose that such a system can be used to auto-label partially annotated sequences and produce ground truth annotation for hard-to-label problems such as depth completion and scene flow. Please see https://anishmadan23.github.io/smore/ for more visual results.
Paper Structure (18 sections, 8 equations, 8 figures, 7 tables)

This paper contains 18 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: NuScenes surface reconstruction produced by aggregating LiDAR scans using human-annotated ego-pose and dynamic object bounding boxes ( left). We introduce a global optimization that refines both ego and object poses so as to minimize a scan-to-surface reconstruction error, dramatically improving results ( right). To do so, we find it crucial to model rolling shutter LiDAR effects, particularly for dynamic objects (\ref{['fig:ablation']}). Please see the animation in the supplement.
  • Figure 2: Dynamic object reconstructions using human-annotated bounding-box annotations ( top left) tend to be noisy. Optimizing over object pose ( top right) improves accuracy, while de-skewing scans to account for dynamic object motion is even more helpful ( bottom). Please see the animation in the supplement.
  • Figure 3: Depth maps produced by our method (left) as compared to those from a SOTA NeRF-based methodtonderski2024neurad (right). In the top row we see that when depth maps are close to the training views, tonderski2024neurad and our method produce comparable results. However, in the bottom row, we see that moving the camera far from the training poses reveals large errors in the density field not present in our surface-based method.
  • Figure 4: (Left) A LiDAR sweep where each point has been colored according to which laser it belongs to (hue) and the time within the sweep it was acquired (lighter is earlier, darker is later). A moving car is passing the ego-vehicle (traveling right) on the left and is captured at both the start and end of the sweep (top right), leading to distortion (the driver-side window is captured twice in different locations). Accounting for this distortion by modeling the object motion is key to the quality of our reconstructions (bottom right).
  • Figure 5: LiDAR is often abstracted as 360-degree sweeps captured with a global shutter, but is actually captured with a continuous rotating shutter from a moving ego-car. Our continuous-time optimization framework correctly models this, dramatically improving the quality of urban reconstructions. Here, we visualize the set of rays captured at a time instant (blue lines) for NuScenes (top) and Argoverse Argoverse2 (bottom). Interestingly, our approach is even more effective for recent AV datasets Argoverse2sun2020scalability that employ multiple spinning lidars, which are often set to be out-of-phase to minimize interference (but adding to the inconsistency of a global shutter approximation).
  • ...and 3 more figures