MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation
Kerui Ren, Jiayang Bai, Linning Xu, Lihan Jiang, Jiangmiao Pang, Mulin Yu, Bo Dai
TL;DR
MV-CoLight presents a two-stage, light-aware object compositing framework that achieves illumination-consistent insertion of objects into 2D images and 3D scenes. It first learns per-view illumination cues with a Swin Transformer, then transfers these cues into a 3D Gaussian color field using Hilbert-curve ordering to enforce multi-view coherence, enabling efficient, feed-forward relighting and shadow generation. The approach is validated on public benchmarks and a new large-scale synthetic dataset (~480k scenes), demonstrating state-of-the-art performance in both single- and multi-view settings and strong generalization to real-world scenes. The work also releases a comprehensive multi-view compositing dataset and shows extensibility to other illumination priors, highlighting practical impact for AR, embodied intelligence, and robotics.
Abstract
Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.
