Table of Contents
Fetching ...

Control Color: Multimodal Diffusion-based Interactive Image Colorization

Zhexin Liang, Zhaochen Li, Shangchen Zhou, Chongyi Li, Chen Change Loy

TL;DR

CtrlColor presents a unified, multimodal diffusion-based colorization framework built on a pre-trained latent diffusion model to support unconditional, prompt-based, stroke-based, and exemplar-based colorization within a single pipeline. It introduces two artifact-mitigation modules—a content-guided deformable autoencoder and streamlined self-attention guidance—to curb color overflow and incorrect coloring while preserving content fidelity. By encoding stroke positions/colors into the diffusion process and leveraging CLIP-based conditioning, CtrlColor achieves richer, more diverse colorization with precise local control and strong qualitative/quantitative performance. The work also demonstrates versatile applications, including interactive interfaces, recolorization, regional edits, iterative editing, and potential video colorization, highlighting practical impact for interactive image editing and restoration.

Abstract

Despite the existence of numerous colorization methods, several limitations still exist, such as lack of user interaction, inflexibility in local colorization, unnatural color rendering, insufficient color variation, and color overflow. To solve these issues, we introduce Control Color (CtrlColor), a multi-modal colorization method that leverages the pre-trained Stable Diffusion (SD) model, offering promising capabilities in highly controllable interactive image colorization. While several diffusion-based methods have been proposed, supporting colorization in multiple modalities remains non-trivial. In this study, we aim to tackle both unconditional and conditional image colorization (text prompts, strokes, exemplars) and address color overflow and incorrect color within a unified framework. Specifically, we present an effective way to encode user strokes to enable precise local color manipulation and employ a practical way to constrain the color distribution similar to exemplars. Apart from accepting text prompts as conditions, these designs add versatility to our approach. We also introduce a novel module based on self-attention and a content-guided deformable autoencoder to address the long-standing issues of color overflow and inaccurate coloring. Extensive comparisons show that our model outperforms state-of-the-art image colorization methods both qualitatively and quantitatively.

Control Color: Multimodal Diffusion-based Interactive Image Colorization

TL;DR

CtrlColor presents a unified, multimodal diffusion-based colorization framework built on a pre-trained latent diffusion model to support unconditional, prompt-based, stroke-based, and exemplar-based colorization within a single pipeline. It introduces two artifact-mitigation modules—a content-guided deformable autoencoder and streamlined self-attention guidance—to curb color overflow and incorrect coloring while preserving content fidelity. By encoding stroke positions/colors into the diffusion process and leveraging CLIP-based conditioning, CtrlColor achieves richer, more diverse colorization with precise local control and strong qualitative/quantitative performance. The work also demonstrates versatile applications, including interactive interfaces, recolorization, regional edits, iterative editing, and potential video colorization, highlighting practical impact for interactive image editing and restoration.

Abstract

Despite the existence of numerous colorization methods, several limitations still exist, such as lack of user interaction, inflexibility in local colorization, unnatural color rendering, insufficient color variation, and color overflow. To solve these issues, we introduce Control Color (CtrlColor), a multi-modal colorization method that leverages the pre-trained Stable Diffusion (SD) model, offering promising capabilities in highly controllable interactive image colorization. While several diffusion-based methods have been proposed, supporting colorization in multiple modalities remains non-trivial. In this study, we aim to tackle both unconditional and conditional image colorization (text prompts, strokes, exemplars) and address color overflow and incorrect color within a unified framework. Specifically, we present an effective way to encode user strokes to enable precise local color manipulation and employ a practical way to constrain the color distribution similar to exemplars. Apart from accepting text prompts as conditions, these designs add versatility to our approach. We also introduce a novel module based on self-attention and a content-guided deformable autoencoder to address the long-standing issues of color overflow and inaccurate coloring. Extensive comparisons show that our model outperforms state-of-the-art image colorization methods both qualitatively and quantitatively.
Paper Structure (28 sections, 11 equations, 29 figures, 5 tables)

This paper contains 28 sections, 11 equations, 29 figures, 5 tables.

Figures (29)

  • Figure 1: The proposed CtrlColor achieves highly controllable image colorization, offering users a simple and intuitive means to colorize images according to their specific preferences. Our method supports both unconditional and conditional colorization, including options such as text, stroke, and exemplar image, as well as allows for any combination of them. Utilizing strokes as masks, our method facilitates selective image editing effortlessly. Our method also supports highly flexible iterative image editing, empowering users to finely tune specific details during the colorization process.
  • Figure 2: Visual comparisons on stroke-based colorization.
  • Figure 3: Left: The main structure of our CtrlColor Model achieves multi-modal controllable colorization by blending controls from diverse components. Right: To manage large color overflow and inaccurate color regions, we integrate content-guided deformable convolution layers into the autoencoder's decoder. These layers restrict deformed color regions to align with nearby colors. Additionally, refined self-attention guidance is employed during inference to blur small overflow areas by referencing the surrounding color distribution. This process aims to smooth the color distribution, effectively addressing the issue of small color overflow.
  • Figure 4: Qualitative comparison for unconditional image colorization. The first row of images is from the COCO-stuff dataset, and the second row comes from the ImageNet validation dataset. Our method generates more vivid and realistic colors with less color bleeding. (Zoom-in for best view) More comparisons are provided in the supplementary material.
  • Figure 5: Qualitative comparisons on stroke-based colorization.
  • ...and 24 more figures