Table of Contents
Fetching ...

3D StreetUnveiler with Semantic-aware 2DGS -- a simple baseline

Jingwei Xu, Yikai Wang, Yiqun Zhao, Yanwei Fu, Shenghua Gao

TL;DR

StreetUnveiler addresses reconstructing an empty street from in-car video by learning a scalable background representation with hard-label semantic $2$DGS and a rendered alpha map to identify unobservable regions. A novel time-reversal inpainting framework, guided by reference-based diffusion (LeftRefill), enforces temporal consistency across long vehicle trajectories, while pseudo-labels from inpainted frames re-optimize the $2$DGS. The method introduces semantic distortion and shrinking losses to stabilize semantic fields and prune unseen Gaussians, enabling robust object removal and clean background geometry. Experiments on Waymo Open Perception and Pandaset show that StreetUnveiler achieves improved cross-view consistency and competitive efficiency, with the ability to extract a complete empty-street mesh for downstream tasks.

Abstract

Unveiling an empty street from crowded observations captured by in-car cameras is crucial for autonomous driving. However, removing all temporarily static objects, such as stopped vehicles and standing pedestrians, presents a significant challenge. Unlike object-centric 3D inpainting, which relies on thorough observation in a small scene, street scene cases involve long trajectories that differ from previous 3D inpainting tasks. The camera-centric moving environment of captured videos further complicates the task due to the limited degree and time duration of object observation. To address these obstacles, we introduce StreetUnveiler to reconstruct an empty street. StreetUnveiler learns a 3D representation of the empty street from crowded observations. Our representation is based on the hard-label semantic 2D Gaussian Splatting (2DGS) for its scalability and ability to identify Gaussians to be removed. We inpaint rendered image after removing unwanted Gaussians to provide pseudo-labels and subsequently re-optimize the 2DGS. Given its temporal continuous movement, we divide the empty street scene into observed, partial-observed, and unobserved regions, which we propose to locate through a rendered alpha map. This decomposition helps us to minimize the regions that need to be inpainted. To enhance the temporal consistency of the inpainting, we introduce a novel time-reversal framework to inpaint frames in reverse order and use later frames as references for earlier frames to fully utilize the long-trajectory observations. Our experiments conducted on the street scene dataset successfully reconstructed a 3D representation of the empty street. The mesh representation of the empty street can be extracted for further applications. The project page and more visualizations can be found at: https://streetunveiler.github.io

3D StreetUnveiler with Semantic-aware 2DGS -- a simple baseline

TL;DR

StreetUnveiler addresses reconstructing an empty street from in-car video by learning a scalable background representation with hard-label semantic DGS and a rendered alpha map to identify unobservable regions. A novel time-reversal inpainting framework, guided by reference-based diffusion (LeftRefill), enforces temporal consistency across long vehicle trajectories, while pseudo-labels from inpainted frames re-optimize the DGS. The method introduces semantic distortion and shrinking losses to stabilize semantic fields and prune unseen Gaussians, enabling robust object removal and clean background geometry. Experiments on Waymo Open Perception and Pandaset show that StreetUnveiler achieves improved cross-view consistency and competitive efficiency, with the ability to extract a complete empty-street mesh for downstream tasks.

Abstract

Unveiling an empty street from crowded observations captured by in-car cameras is crucial for autonomous driving. However, removing all temporarily static objects, such as stopped vehicles and standing pedestrians, presents a significant challenge. Unlike object-centric 3D inpainting, which relies on thorough observation in a small scene, street scene cases involve long trajectories that differ from previous 3D inpainting tasks. The camera-centric moving environment of captured videos further complicates the task due to the limited degree and time duration of object observation. To address these obstacles, we introduce StreetUnveiler to reconstruct an empty street. StreetUnveiler learns a 3D representation of the empty street from crowded observations. Our representation is based on the hard-label semantic 2D Gaussian Splatting (2DGS) for its scalability and ability to identify Gaussians to be removed. We inpaint rendered image after removing unwanted Gaussians to provide pseudo-labels and subsequently re-optimize the 2DGS. Given its temporal continuous movement, we divide the empty street scene into observed, partial-observed, and unobserved regions, which we propose to locate through a rendered alpha map. This decomposition helps us to minimize the regions that need to be inpainted. To enhance the temporal consistency of the inpainting, we introduce a novel time-reversal framework to inpaint frames in reverse order and use later frames as references for earlier frames to fully utilize the long-trajectory observations. Our experiments conducted on the street scene dataset successfully reconstructed a 3D representation of the empty street. The mesh representation of the empty street can be extracted for further applications. The project page and more visualizations can be found at: https://streetunveiler.github.io
Paper Structure (36 sections, 11 equations, 25 figures, 7 tables)

This paper contains 36 sections, 11 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: We achieve accurate empty street reconstruction from in-car camera videos. With the aid of the proposed hard-label semantic 2D Gaussian Splatting and time-reversal inpainting framework, we remove the unwanted objects with satisfactory appearance and geometry of occluded regions.
  • Figure 2: (a) The mask of the whole unwanted object; (b) Inpainting with (a) mask; (c) Generate the inpainting mask through a rendered alpha map. The pixel with a low alpha value is selected as an inpainted pixel; (d) Inpainting with the generated (c) mask.
  • Figure 3: Illustration of reference-based inpainting of two views. Left: When we inpaint the near view with the far view as a reference, the consistency of the inpainting result degenerates. There are fewer matching pixels between the reference far-view image and near-view inpainting result; Right: Inpainting the far view using the near view as a reference results in better quality and more accurate pixel matching. It's easier to generate the low-resolution content with the high-resolution image as a reference.
  • Figure 4: Illustration of time-reversal inpainting. After we remove the Gaussians of the objects, we first unconditionally inpaint both frame $T_n$ and $T_{n+1}$ with cao2023zits++. Then we transmit the pixels from frame $T_{n+1}$ to frame $T_{n}$ in the form of reference-based inpainting cao2024leftrefill. From a high-level understanding, we inpaint the earlier frame $T_{n}$ with the later frame $T_{n+1}$ as a reference condition.
  • Figure 5: Qualitative comparison results of our methods. Our methods achieve clearer results than temporarily consistent inpainting baselines. Video comparisons will be placed in the supplementary.
  • ...and 20 more figures