Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Zixin Yin; Xili Dai; Ling-Hao Chen; Deyu Zhou; Jianan Wang; Duomin Wang; Gang Yu; Lionel M. Ni; Lei Zhang; Heung-Yeung Shum

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum

TL;DR

ColorCtrl tackles training-free text-guided color editing with physical-consistency constraints by leveraging Multi-Modal Diffusion Transformer (MM-DiT) attention. It disentangles structure and color through four components: structure preservation, color preservation, and optional attribute re-weighting, enabling edits that affect albedo, light source color, and ambient illumination while keeping geometry, material properties, and light-matter interactions intact. The method demonstrates state-of-the-art performance among training-free approaches on SD3 and FLUX.1-dev, surpasses strong commercial baselines in consistency, and extends effectively to video diffusion models like CogVideoX and to instruction-based editing models, underscoring broad applicability and practicality. ColorCtrl’s model-agnostic design, reliance on attention-map manipulation, and its robust real- and synthetic-image results position it as a scalable solution for high-fidelity, controllable color editing in both research and real-world deployment.

Abstract

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

TL;DR

Abstract

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)