Table of Contents
Fetching ...

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum

TL;DR

ColorCtrl tackles training-free text-guided color editing with physical-consistency constraints by leveraging Multi-Modal Diffusion Transformer (MM-DiT) attention. It disentangles structure and color through four components: structure preservation, color preservation, and optional attribute re-weighting, enabling edits that affect albedo, light source color, and ambient illumination while keeping geometry, material properties, and light-matter interactions intact. The method demonstrates state-of-the-art performance among training-free approaches on SD3 and FLUX.1-dev, surpasses strong commercial baselines in consistency, and extends effectively to video diffusion models like CogVideoX and to instruction-based editing models, underscoring broad applicability and practicality. ColorCtrl’s model-agnostic design, reliance on attention-map manipulation, and its robust real- and synthetic-image results position it as a scalable solution for high-fidelity, controllable color editing in both research and real-world deployment.

Abstract

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

TL;DR

ColorCtrl tackles training-free text-guided color editing with physical-consistency constraints by leveraging Multi-Modal Diffusion Transformer (MM-DiT) attention. It disentangles structure and color through four components: structure preservation, color preservation, and optional attribute re-weighting, enabling edits that affect albedo, light source color, and ambient illumination while keeping geometry, material properties, and light-matter interactions intact. The method demonstrates state-of-the-art performance among training-free approaches on SD3 and FLUX.1-dev, surpasses strong commercial baselines in consistency, and extends effectively to video diffusion models like CogVideoX and to instruction-based editing models, underscoring broad applicability and practicality. ColorCtrl’s model-agnostic design, reliance on attention-map manipulation, and its robust real- and synthetic-image results position it as a scalable solution for high-fidelity, controllable color editing in both research and real-world deployment.

Abstract

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

Paper Structure

This paper contains 50 sections, 2 equations, 23 figures, 12 tables.

Figures (23)

  • Figure 1: Text-conditioned color editing. Our method, ColorCtrl with FLUX.1-dev, edits colors across multiple materials while preserving light-matter interactions. For example, in the fourth case, the ball's color, its water reflection, specular highlights, and even small droplets on the glass have all been changed. It also enables fine-grained control over the intensity of specific descriptive terms.
  • Figure 2: Pipeline of ColorCtrl. (a) Visualizes the attention mechanism in MM-DiT blocks. (b) Enables color editing while maintaining structural consistency. (c) Preserves colors in non-editing regions. (d) Applies attribute re-weighting to specific tokens. Symbols in the source branch have no superscript. Symbols with a superscript $^*$ indicate the target, and hats (e.g., $\hat{V}$, $\hat{M}$) denote outputs.
  • Figure 3: Top row: SD3 results; bottom row: FLUX.1-dev results. (a) The edit prompt is "white fox" $\to$ "orange fox". Left to right: source image, our full method, without color preservation, with swapped text-to-text part in structure preservation, and with swapped $V^{\text{text}}$ in color preservation. (b) The generation prompt is "a white fox in a forest", and the token for mask extraction is "fox". From left to right: the mask extracted from vision-to-text parts, and from text-to-vision parts.
  • Figure 4: Qualitative image results compared with training-free methods and commercial models on PIE-Bench. The top three rows are generated using FLUX.1-dev, while the bottom two are generated using SD3. Best viewed with zoom-in.
  • Figure 5: Examples of attribute re-weighting. The top two rows are generated using FLUX.1-dev, while the bottom one are generated using SD3.
  • ...and 18 more figures