Table of Contents
Fetching ...

UrbanIR: Large-Scale Urban Scene Inverse Rendering from a Single Video

Chih-Hao Lin, Bohan Liu, Yi-Ting Chen, Kuan-Sheng Chen, David Forsyth, Jia-Bin Huang, Anand Bhattad, Shenlong Wang

TL;DR

UrbanIR tackles inverse rendering of unbounded urban scenes from a single video by learning a relightable neural scene model that decouples intrinsic properties from illumination. It combines a compact, hash-based scene representation with a sun–sky–ambient lighting model and a novel visibility-driven shadow loss to recover accurate shadow volumes under a single illumination. The framework supports realistic relighting, night simulation, and physically plausible object insertion, demonstrated on KITTI-360 and Waymo Open datasets with clear improvements over baselines like NeRF-OSR and Instruct-NeRF2NeRF. This work enables practical, photorealistic editing of large-scale outdoor scenes from monocular video, reducing the dependence on multi-view data or depth sensing while delivering controllable, physics-informed renderings.

Abstract

We present UrbanIR (Urban Scene Inverse Rendering), a new inverse graphics model that enables realistic, free-viewpoint renderings of scenes under various lighting conditions with a single video. It accurately infers shape, albedo, visibility, and sun and sky illumination from wide-baseline videos, such as those from car-mounted cameras, differing from NeRF's dense view settings. In this context, standard methods often yield subpar geometry and material estimates, such as inaccurate roof representations and numerous 'floaters'. UrbanIR addresses these issues with novel losses that reduce errors in inverse graphics inference and rendering artifacts. Its techniques allow for precise shadow volume estimation in the original scene. The model's outputs support controllable editing, enabling photorealistic free-viewpoint renderings of night simulations, relit scenes, and inserted objects, marking a significant improvement over existing state-of-the-art methods.

UrbanIR: Large-Scale Urban Scene Inverse Rendering from a Single Video

TL;DR

UrbanIR tackles inverse rendering of unbounded urban scenes from a single video by learning a relightable neural scene model that decouples intrinsic properties from illumination. It combines a compact, hash-based scene representation with a sun–sky–ambient lighting model and a novel visibility-driven shadow loss to recover accurate shadow volumes under a single illumination. The framework supports realistic relighting, night simulation, and physically plausible object insertion, demonstrated on KITTI-360 and Waymo Open datasets with clear improvements over baselines like NeRF-OSR and Instruct-NeRF2NeRF. This work enables practical, photorealistic editing of large-scale outdoor scenes from monocular video, reducing the dependence on multi-view data or depth sensing while delivering controllable, physics-informed renderings.

Abstract

We present UrbanIR (Urban Scene Inverse Rendering), a new inverse graphics model that enables realistic, free-viewpoint renderings of scenes under various lighting conditions with a single video. It accurately infers shape, albedo, visibility, and sun and sky illumination from wide-baseline videos, such as those from car-mounted cameras, differing from NeRF's dense view settings. In this context, standard methods often yield subpar geometry and material estimates, such as inaccurate roof representations and numerous 'floaters'. UrbanIR addresses these issues with novel losses that reduce errors in inverse graphics inference and rendering artifacts. Its techniques allow for precise shadow volume estimation in the original scene. The model's outputs support controllable editing, enabling photorealistic free-viewpoint renderings of night simulations, relit scenes, and inserted objects, marking a significant improvement over existing state-of-the-art methods.
Paper Structure (30 sections, 14 equations, 15 figures, 9 tables)

This paper contains 30 sections, 14 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: We present UrbanIR ( Urban Scene Inverse Rendering), a realistic and relightable neural scene model. UrbanIR infers accurate scene properties from a single video of large-scale, unbounded scenes and delivers realistic relighting, night simulation, and object insertion.
  • Figure 2: Rendering Pipeline. UrbanIR retrieves scene intrinsics (normal $N$, semantics $S$, albedo $A$) from camera rays, and estimate visibility $V$ from tracing rays to the light source. The shading model computes diffuse and specular reflection and adds ambient sky light $\mathbf{L}_{\text{sky}}$ for the final shading map. We multiply shading & albedo, and render the sky appearance for final rendering. (Eq. \ref{['eq:shading']} for more details.)
  • Figure 3: Intrinsic Decomposition of Waymo Open Dataset Sun_2020_CVPR. We thank the FEGR authors for sharing the results of their Waymo testing sequence with us for comparison. UrbanIR not only decomposes albedo and shadow better but also produces smoother and more detailed albedo and normal. We recommend readers zoom in to view the difference in the intrinsic images.
  • Figure 4: Intrinsic Decomposition Comparison. Please note that NeRF-OSR rudnev2021neural fails to decompose intrinsic, and RelightNet rudnev2021neural tends to bake shadow in the albedo.
  • Figure 5: Shadow Removal in Albedo. Our method correctly recovers albedo under a shadow while ShadowFormer guo2023shadowformer fails to.
  • ...and 10 more figures