Table of Contents
Fetching ...

CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction

Jisu Shin, Richard Shaw, Seunghyun Shin, Zhensong Zhang, Hae-Gon Jeon, Eduardo Perez-Pellitero

TL;DR

CHROMA tackles photometric inconsistencies across multi-view captures that hinder robust 3D reconstruction and novel view synthesis. It introduces a feed-forward transformer that predicts per-view 3D bilateral grids and confidence maps to harmonize appearances with a chosen reference frame, enabling cross-view consistency without scene-specific optimization. The method leverages a reference-frame selection strategy, synthetic paired data plus self-supervised training via a 3D foundation model, and a multi-view bilateral-grid architecture that supports efficient real-time processing of hundreds of frames. Empirically, CHROMA matches or surpasses scene-specific appearance-embedding approaches in reconstruction quality while reducing training time and enabling scalable, cross-scene generalization for large-scale 3D pipelines.

Abstract

Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade novel view synthesis. Joint optimization of scene-specific representations and per-image appearance embeddings has been proposed to address this issue, but with increased computational complexity and slower training. In this work, we propose a generalizable, feed-forward approach that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner. Our model processes hundreds of frames in a single step, enabling efficient large-scale harmonization, and seamlessly integrates into downstream 3D reconstruction models, providing cross-scene generalization without requiring scene-specific retraining. To overcome the lack of paired data, we employ a hybrid self-supervised rendering loss leveraging 3D foundation models, improving generalization to real-world variations. Extensive experiments show that our approach outperforms or matches the reconstruction quality of existing scene-specific optimization methods with appearance modeling, without significantly affecting the training time of baseline 3D models.

CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction

TL;DR

CHROMA tackles photometric inconsistencies across multi-view captures that hinder robust 3D reconstruction and novel view synthesis. It introduces a feed-forward transformer that predicts per-view 3D bilateral grids and confidence maps to harmonize appearances with a chosen reference frame, enabling cross-view consistency without scene-specific optimization. The method leverages a reference-frame selection strategy, synthetic paired data plus self-supervised training via a 3D foundation model, and a multi-view bilateral-grid architecture that supports efficient real-time processing of hundreds of frames. Empirically, CHROMA matches or surpasses scene-specific appearance-embedding approaches in reconstruction quality while reducing training time and enabling scalable, cross-scene generalization for large-scale 3D pipelines.

Abstract

Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade novel view synthesis. Joint optimization of scene-specific representations and per-image appearance embeddings has been proposed to address this issue, but with increased computational complexity and slower training. In this work, we propose a generalizable, feed-forward approach that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner. Our model processes hundreds of frames in a single step, enabling efficient large-scale harmonization, and seamlessly integrates into downstream 3D reconstruction models, providing cross-scene generalization without requiring scene-specific retraining. To overcome the lack of paired data, we employ a hybrid self-supervised rendering loss leveraging 3D foundation models, improving generalization to real-world variations. Extensive experiments show that our approach outperforms or matches the reconstruction quality of existing scene-specific optimization methods with appearance modeling, without significantly affecting the training time of baseline 3D models.

Paper Structure

This paper contains 10 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) Input views with inconsistent appearance; (b) input views harmonized by our model; (c) novel view renderings of 3DGS fitted to inconsistent input views and ones corrected by our model; (d) comparison with 3DGS-based appearance embedding methods on varying exposure dataset.
  • Figure 2: Architecture Overview. Our model first patchifies the reference frame $\mathbf{I}_\textit{ref}$ and $N$ input multi-view source images $\{\mathbf{I}_i\}^N_{i=1}$ into tokens. These are passed through the transformer encoder blocks comprising alternating frame-wise and global self-attention layers, repeated $3$ times. The decoder uses alternating frame-attention and cross-attention with the reference frame. A final grid prediction head predicts the image and confidence bilateral grids ($\mathbf{B}_i$ and $\mathbf{C}_i$), which are subsequently sliced to produce the corrected frames $\{{\mathbf{I}'_i}\}^N_{i=1}$ and confidence maps $\{\mathbf{C}'_i\}^N_{i=1}$. Based on our reference frame selection which chooses the frame with best photometric quality, we use the resulting harmonized images to train a wide range of 3D reconstruction models.
  • Figure 3: 3D Foundation Model based Self-Supervised Loss Pipeline.
  • Figure 4: Qualitative results grouped by dataset: DL3DV, LOM, and BilaRF.