UrbanIR: Large-Scale Urban Scene Inverse Rendering from a Single Video
Chih-Hao Lin, Bohan Liu, Yi-Ting Chen, Kuan-Sheng Chen, David Forsyth, Jia-Bin Huang, Anand Bhattad, Shenlong Wang
TL;DR
UrbanIR tackles inverse rendering of unbounded urban scenes from a single video by learning a relightable neural scene model that decouples intrinsic properties from illumination. It combines a compact, hash-based scene representation with a sun–sky–ambient lighting model and a novel visibility-driven shadow loss to recover accurate shadow volumes under a single illumination. The framework supports realistic relighting, night simulation, and physically plausible object insertion, demonstrated on KITTI-360 and Waymo Open datasets with clear improvements over baselines like NeRF-OSR and Instruct-NeRF2NeRF. This work enables practical, photorealistic editing of large-scale outdoor scenes from monocular video, reducing the dependence on multi-view data or depth sensing while delivering controllable, physics-informed renderings.
Abstract
We present UrbanIR (Urban Scene Inverse Rendering), a new inverse graphics model that enables realistic, free-viewpoint renderings of scenes under various lighting conditions with a single video. It accurately infers shape, albedo, visibility, and sun and sky illumination from wide-baseline videos, such as those from car-mounted cameras, differing from NeRF's dense view settings. In this context, standard methods often yield subpar geometry and material estimates, such as inaccurate roof representations and numerous 'floaters'. UrbanIR addresses these issues with novel losses that reduce errors in inverse graphics inference and rendering artifacts. Its techniques allow for precise shadow volume estimation in the original scene. The model's outputs support controllable editing, enabling photorealistic free-viewpoint renderings of night simulations, relit scenes, and inserted objects, marking a significant improvement over existing state-of-the-art methods.
