Table of Contents
Fetching ...

Color Alignment in Diffusion

Ka Chun Shum, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung

TL;DR

The paper addresses the challenge of fine-grained color-conditioned image synthesis with diffusion models by introducing color-aligned diffusion, which constrains diffusion intermediates to a conditional color space defined by a color pattern $\mathbf{c}$. It develops both image-space and latent-space implementations, plus a zero-shot variant using a one-to-one color mapping $g$ during sampling, and formalizes color alignment through a color-mapping function $f(\mathbf{x}_t,\mathbf{c})$ and a training objective that emphasizes color accuracy, completeness, and disentanglement. The approach encompasses re-training, fine-tuning, and training-free zero-shot options, and it is validated on diverse datasets (Oxford-Flower, Microsoft Emoji, and Text-to-Image-2M) with comparisons to multiple baselines, demonstrating state-of-the-art color conditioning while preserving generation quality and diversity. The work includes extensive ablations and analyses of blurring before latent encoding and late-time stopping, and discusses future directions toward video and 3D extensions, highlighting practical impact for color-controlled creative applications.

Abstract

Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.

Color Alignment in Diffusion

TL;DR

The paper addresses the challenge of fine-grained color-conditioned image synthesis with diffusion models by introducing color-aligned diffusion, which constrains diffusion intermediates to a conditional color space defined by a color pattern . It develops both image-space and latent-space implementations, plus a zero-shot variant using a one-to-one color mapping during sampling, and formalizes color alignment through a color-mapping function and a training objective that emphasizes color accuracy, completeness, and disentanglement. The approach encompasses re-training, fine-tuning, and training-free zero-shot options, and it is validated on diverse datasets (Oxford-Flower, Microsoft Emoji, and Text-to-Image-2M) with comparisons to multiple baselines, demonstrating state-of-the-art color conditioning while preserving generation quality and diversity. The work includes extensive ablations and analyses of blurring before latent encoding and late-time stopping, and discusses future directions toward video and 3D extensions, highlighting practical impact for color-controlled creative applications.

Abstract

Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.

Paper Structure

This paper contains 24 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Examples of color-conditioned generation from different methods. Color conditions (a) can be derived from imagery (top two rows) or manual drawing (bottom two rows). Input text prompts are "painting" and "pen" (1st & 3rd rows), and "boat" (2nd & 4th rows). Our method (b) can generate pixels closely aligned with the color conditions and effectively disentangle the colors into different structures. Existing baselines hertz2024styleye2023iprombach2022highzhang2023adding (from top to bottom) (c) struggle with either the semantics (the first two rows) or colors (the last two rows) of generated contents.
  • Figure 2: Pipelines of the regular diffusion (a) and its color-conditioned version (b), our color-aligned diffusion (c), where noisy samples $\mathbf{x}_t$ are aligned with conditioned colors, and our zero-shot version (d), where color alignment is directly applied to predicted sample $\hat{\mathbf{x}}_0$.
  • Figure 3: Qualitative results of color-aligned image diffusion. (a)-(b) Visualization of the diffusion process by our method and DDPM ho2020denoising. (c) Generation results of our method and DDPM ho2020denoising.
  • Figure 4: Qualitative results of color-aligned latent diffusion. Each input (first row) includes an in-the-wild image as color condition and a target text prompt. Each column presents results of experimented methods using the same input.
  • Figure 5: Qualitative results of sampling and editing of input color conditions. All results are generated by our fine-tuned color-aligned latent model. Red arrows represent the generation process. Blue arrows represent the editing of color conditions.
  • ...and 7 more figures