CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction
Jisu Shin, Richard Shaw, Seunghyun Shin, Zhensong Zhang, Hae-Gon Jeon, Eduardo Perez-Pellitero
TL;DR
CHROMA tackles photometric inconsistencies across multi-view captures that hinder robust 3D reconstruction and novel view synthesis. It introduces a feed-forward transformer that predicts per-view 3D bilateral grids and confidence maps to harmonize appearances with a chosen reference frame, enabling cross-view consistency without scene-specific optimization. The method leverages a reference-frame selection strategy, synthetic paired data plus self-supervised training via a 3D foundation model, and a multi-view bilateral-grid architecture that supports efficient real-time processing of hundreds of frames. Empirically, CHROMA matches or surpasses scene-specific appearance-embedding approaches in reconstruction quality while reducing training time and enabling scalable, cross-scene generalization for large-scale 3D pipelines.
Abstract
Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade novel view synthesis. Joint optimization of scene-specific representations and per-image appearance embeddings has been proposed to address this issue, but with increased computational complexity and slower training. In this work, we propose a generalizable, feed-forward approach that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner. Our model processes hundreds of frames in a single step, enabling efficient large-scale harmonization, and seamlessly integrates into downstream 3D reconstruction models, providing cross-scene generalization without requiring scene-specific retraining. To overcome the lack of paired data, we employ a hybrid self-supervised rendering loss leveraging 3D foundation models, improving generalization to real-world variations. Extensive experiments show that our approach outperforms or matches the reconstruction quality of existing scene-specific optimization methods with appearance modeling, without significantly affecting the training time of baseline 3D models.
