Table of Contents
Fetching ...

Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models

Héctor Laria, Alexandra Gomez-Villa, Jiang Qin, Muhammad Atif Butt, Bogdan Raducanu, Javier Vazquez-Corral, Joost van de Weijer, Kai Wang

TL;DR

This work tackles the challenge of precise color control in text-to-image diffusion by proposing ColorWave, a training-free approach that exploits semantic attribute binding in IP-Adapter to map RGB values to linguistic color descriptors. It introduces automatic color-name generation and a spatial prior to selectively modulate attention, enabling exact RGB-level control without fine-tuning. Empirical results show ColorWave outperforms training-free baselines on color accuracy and realism, and approaches the performance of color-specific methods like ColorPeel while offering immediate, arbitrary color specification. The method demonstrates robust color control across diverse objects and contexts, signaling a new, practical paradigm for color-consistent diffusion-based synthesis.

Abstract

Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.

Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models

TL;DR

This work tackles the challenge of precise color control in text-to-image diffusion by proposing ColorWave, a training-free approach that exploits semantic attribute binding in IP-Adapter to map RGB values to linguistic color descriptors. It introduces automatic color-name generation and a spatial prior to selectively modulate attention, enabling exact RGB-level control without fine-tuning. Empirical results show ColorWave outperforms training-free baselines on color accuracy and realism, and approaches the performance of color-specific methods like ColorPeel while offering immediate, arbitrary color specification. The method demonstrates robust color control across diverse objects and contexts, signaling a new, practical paradigm for color-consistent diffusion-based synthesis.

Abstract

Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.

Paper Structure

This paper contains 18 sections, 4 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: ColorWave accurately reproduces subtle color variations in smooth interpolation between similar tones. Each column shows a different target object rendered with gradually shifting colors (displayed above each image). The results demonstrate that our method ColorWave is sensitive to small changes in the RGB color space while preserving realistic object appearance and scene composition.
  • Figure 2: Illustration of semantic attribute binding. On top, the color-guidance image is provided. a) Given the color name used in the prompt, the generated results will pick the respective color from the color-guidance image. b) Changing the colors of the color guidance image, results in similar changes in the generated images. c) The exact color which should be generated for the used color name can also be a synthetic example.
  • Figure 3: Semantic attribute binding. Phenomenon visualization through a similarity matrix between color names in text prompts and RGB color values. The heatmap shows the normalized dot product similarity between key projections of color word tokens and image features.
  • Figure 4: Overview of ColorWave. Our approach leverages semantic attribute binding between IP-Adapter and text cross-attention pathways to achieve precise color control. User-specified RGB values are encoded through IP-Adapter and selectively bound to object tokens in the text prompt, enabling training-free color attribution while preserving generative quality.
  • Figure 5: Limitations when directly exploiting semantic attribute binding for color control. (a) Reference image with oval. (b,c) Shape and size variations alter the resulting bird coloration despite using the same green color. (d) With multiple green regions, attribution becomes ambiguous and inconsistent. (e) Using synthetic color references produces flat, unrealistic textures.
  • ...and 16 more figures