Color Alignment in Diffusion
Ka Chun Shum, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung
TL;DR
The paper addresses the challenge of fine-grained color-conditioned image synthesis with diffusion models by introducing color-aligned diffusion, which constrains diffusion intermediates to a conditional color space defined by a color pattern $\mathbf{c}$. It develops both image-space and latent-space implementations, plus a zero-shot variant using a one-to-one color mapping $g$ during sampling, and formalizes color alignment through a color-mapping function $f(\mathbf{x}_t,\mathbf{c})$ and a training objective that emphasizes color accuracy, completeness, and disentanglement. The approach encompasses re-training, fine-tuning, and training-free zero-shot options, and it is validated on diverse datasets (Oxford-Flower, Microsoft Emoji, and Text-to-Image-2M) with comparisons to multiple baselines, demonstrating state-of-the-art color conditioning while preserving generation quality and diversity. The work includes extensive ablations and analyses of blurring before latent encoding and late-time stopping, and discusses future directions toward video and 3D extensions, highlighting practical impact for color-controlled creative applications.
Abstract
Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.
