Table of Contents
Fetching ...

MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

Kerui Ren, Jiayang Bai, Linning Xu, Lihan Jiang, Jiangmiao Pang, Mulin Yu, Bo Dai

TL;DR

MV-CoLight presents a two-stage, light-aware object compositing framework that achieves illumination-consistent insertion of objects into 2D images and 3D scenes. It first learns per-view illumination cues with a Swin Transformer, then transfers these cues into a 3D Gaussian color field using Hilbert-curve ordering to enforce multi-view coherence, enabling efficient, feed-forward relighting and shadow generation. The approach is validated on public benchmarks and a new large-scale synthetic dataset (~480k scenes), demonstrating state-of-the-art performance in both single- and multi-view settings and strong generalization to real-world scenes. The work also releases a comprehensive multi-view compositing dataset and shows extensibility to other illumination priors, highlighting practical impact for AR, embodied intelligence, and robotics.

Abstract

Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.

MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

TL;DR

MV-CoLight presents a two-stage, light-aware object compositing framework that achieves illumination-consistent insertion of objects into 2D images and 3D scenes. It first learns per-view illumination cues with a Swin Transformer, then transfers these cues into a 3D Gaussian color field using Hilbert-curve ordering to enforce multi-view coherence, enabling efficient, feed-forward relighting and shadow generation. The approach is validated on public benchmarks and a new large-scale synthetic dataset (~480k scenes), demonstrating state-of-the-art performance in both single- and multi-view settings and strong generalization to real-world scenes. The work also releases a comprehensive multi-view compositing dataset and shows extensibility to other illumination priors, highlighting practical impact for AR, embodied intelligence, and robotics.

Abstract

Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.

Paper Structure

This paper contains 32 sections, 7 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Illustration of our object compositing pipeline with harmonization and relighting using MV-CoLight. In (a), we show a composite scene with visually inconsistent inserted objects. Applying our MV-CoLight method in (b), we generate realistic lighting, shadows, and harmonious integration of objects into the 3D scene. Panel (c) highlights clear visual differences before and after harmonization, accompanied by consistent novel view renderings below. Explore more demos on our project page: https://city-super.github.io/mvcolight/.
  • Figure 2: Pipeline of MV-CoLight. In (a), we insert a white puppy as the composite object onto the table between basketballs, and render multi-view inharmonious images, background-only images, and depth maps using a camera trajectory moving from distant to close-up positions. Subsequently in (b), we input a single-view data into the 2D object compositing model, which processes the data through multiple Swin Transformer blocks to output the harmonized result. Finally in (c), we project the multi-view features from 2D models into Gaussian space via $\Phi(\cdot)$, combine them with the original inharmonious Gaussian colors projected into 2D Gaussian color space through $\Psi(\cdot)$, and then feed them into the 3D object compositing model. The model outputs harmonized Gaussian colors and computes rendering loss by incorporating Gaussian shape attributes.
  • Figure 3: Mapping multi-view observations into a 2D Hilbert-ordered Gaussian color map. Starting from inharmonious multi-view images, depth maps, and camera poses, we compute per-view point maps and randomly sample $M$ points to initialize 3D Gaussian primitives, which we then optimize to fit the scene. Next, we construct a 3D Hilbert curve through the Gaussian centers and assign each primitive to its nearest curve point, yielding an ordered 1D sequence. Finally, we fold this sequence into a 2D grid along a 2D Hilbert curve, producing a spatially coherent projection in which each pixel encodes the color of its corresponding Gaussian.
  • Figure 4: Single-view qualitative comparison with SOTA methods xing2024luminetchen2025empiricalguerreiro2023pctsong2023objectstitchzhang2023controlcomZeng_2024iclight on our proposed dataset and public datasets zhang2023controlcomummenhofer2024objects, with differences highlighted via colored patches. Compared to existing baselines, our method successfully generates illumination consistent with the background and physically plausible shadows while decoupling highlights from inserted objects, demonstrating generalization capabilities on out-of-domain datasets. The method in the green box does not incorporate background images as input, whereas the others do.
  • Figure 5: Multi-view qualitative comparison with SOTA methods xing2024luminetchen2025empiricalguerreiro2023pctsong2023objectstitchzhang2023controlcomZeng_2024iclightliang2024gschen2024gigu2024irgs on our proposed dataset and real captured scenes, with differences highlighted via colored patches. Our method synthesizes plausible illumination and shadows while ensuring multi-view consistency. The method in the green box does not incorporate background images as input, whereas the others do.
  • ...and 11 more figures