Table of Contents
Fetching ...

Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos

Yujin Ham, Junho Kim, Vivek Boominathan, Guha Balakrishnan

Abstract

Egocentric "walking tour" videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking tour videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D/4D models of urban locations.

Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos

Abstract

Egocentric "walking tour" videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking tour videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D/4D models of urban locations.

Paper Structure

This paper contains 14 sections, 3 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Example results of CrowdEraser, the proposed method in this study, for removing humans and their associated effects from three egocentric "walking tour" video clips. The model takes a video clip along with foreground human masks as input (top). CrowdEraser generates a new video clip with the humans and their shadows removed. CrowdEraser works well when confronted with significant human presence due to (a,c) crowds, and (b) proximity of others to the camera wearer. Comparisons with baseline methods for these scenes are provided in the supplementary material.
  • Figure 2: Locations of background video clips in EgoCrowds. Training clip locations are in green, and testing locations are in red. The full list of city names are in Supplementary.
  • Figure 3: Overview of our data construction pipeline. Both background and foreground clips are sourced from real "walking tour" videos. The foreground clips were selected to ensure an approximately uniform distribution across different Crowd% levels. For each instance, we generate a soft shadow with randomized strength and angle by applying an affine transform to the human mask (red dots indicate pivot points).
  • Figure 4: Shadow injection with varying $\alpha$ values. As $\alpha$ ranges from 0.2 (a) to 0.8 (d), the shadow intensity appears stronger.
  • Figure 5: Qualitative comparison. Red boxes indicate failures to remove humans or their shadows and yellow boxes highlight areas where the background is over-smoothed instead of being filled with plausible content. When there are fewer people and the background is relatively simple, as in (a) Jakarta, all methods perform reasonably well. However, performance degrades as masks become larger and backgrounds more complex. Also, in particular, ProPainter zhou2023propainter and DiffuEraser li2025diffueraser struggle when the cast shadow is sharp (i.e., less diffused). Casper Lee_2025_CVPR is robust at associating effects, but when the mask is large it often hallucinates objects or people inside the masked region. In contrast, our method shows greater robustness for large masks, preserving background structure with fewer noticeable artifacts.
  • ...and 10 more figures