Table of Contents
Fetching ...

ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

Dingkun Yan, Liang Yuan, Erwin Wu, Yuma Nishioka, Issei Fujishiro, Suguru Saito

TL;DR

The authors tackle the distribution problem in reference-based sketch colorization with latent diffusion models that use CLIP image tokens as conditions, enabling zero-shot sequential text manipulation. They propose CLS and Attention variants, plus training strategies (dropping, noisy, and dual-conditioned losses) to align $p_{\\theta}(z|s,r)$ with $p(z|s)$ and reduce deterioration. Their global and local text-based manipulation methods leverage CLIP embeddings to fluently steer colorization, demonstrated through ablations, baselines, and a user study. The results show improved sketch fidelity and controllability, with practical implications for anime-style colorization workflows, though local manipulation remains challenging and interface usability can be enhanced.

Abstract

Diffusion models have recently demonstrated their effectiveness in generating extremely high-quality images and are now utilized in a wide range of applications, including automatic sketch colorization. Although many methods have been developed for guided sketch colorization, there has been limited exploration of the potential conflicts between image prompts and sketch inputs, which can lead to severe deterioration in the results. Therefore, this paper exhaustively investigates reference-based sketch colorization models that aim to colorize sketch images using reference color images. We specifically investigate two critical aspects of reference-based diffusion models: the "distribution problem", which is a major shortcoming compared to text-based counterparts, and the capability in zero-shot sequential text-based manipulation. We introduce two variations of an image-guided latent diffusion model utilizing different image tokens from the pre-trained CLIP image encoder and propose corresponding manipulation methods to adjust their results sequentially using weighted text inputs. We conduct comprehensive evaluations of our models through qualitative and quantitative experiments as well as a user study.

ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

TL;DR

The authors tackle the distribution problem in reference-based sketch colorization with latent diffusion models that use CLIP image tokens as conditions, enabling zero-shot sequential text manipulation. They propose CLS and Attention variants, plus training strategies (dropping, noisy, and dual-conditioned losses) to align with and reduce deterioration. Their global and local text-based manipulation methods leverage CLIP embeddings to fluently steer colorization, demonstrated through ablations, baselines, and a user study. The results show improved sketch fidelity and controllability, with practical implications for anime-style colorization workflows, though local manipulation remains challenging and interface usability can be enhanced.

Abstract

Diffusion models have recently demonstrated their effectiveness in generating extremely high-quality images and are now utilized in a wide range of applications, including automatic sketch colorization. Although many methods have been developed for guided sketch colorization, there has been limited exploration of the potential conflicts between image prompts and sketch inputs, which can lead to severe deterioration in the results. Therefore, this paper exhaustively investigates reference-based sketch colorization models that aim to colorize sketch images using reference color images. We specifically investigate two critical aspects of reference-based diffusion models: the "distribution problem", which is a major shortcoming compared to text-based counterparts, and the capability in zero-shot sequential text-based manipulation. We introduce two variations of an image-guided latent diffusion model utilizing different image tokens from the pre-trained CLIP image encoder and propose corresponding manipulation methods to adjust their results sequentially using weighted text inputs. We conduct comprehensive evaluations of our models through qualitative and quantitative experiments as well as a user study.
Paper Structure (17 sections, 13 equations, 20 figures, 2 tables, 2 algorithms)

This paper contains 17 sections, 13 equations, 20 figures, 2 tables, 2 algorithms.

Figures (20)

  • Figure 1: Illustration of distribution problem in T2I colorization. The network prioritizes prompt conditions over the sketch in the arm regions. This preference results in unexpected colorization discrepancies, particularly in areas anticipated to be skin-toned, thereby leading to visually discordant segmentation. Presented results are derived from the ControlNet_lineart_anime + Anything v3 framework.
  • Figure 2: Illustration of deterioration caused by the distribution problem: (1) quality of textures, (2) erroneously rendered objects, and (3) segmentation error. Shuffle-0drop is one of our ablation models.
  • Figure 3: Illustration of the distribution problem. Most parts of the optimized distribution $p_{\theta}(z|s,r)$ after training lie outside of $p(z|s)$.
  • Figure 4: Training pipelines of the proposed Attention models. We introduce two training strategies for the Attention model, namely, deformation and shuffle training. Deformed images and sketch images are generated before training begins. Noisy training performs diffusion on the local tokens and is combined with either shuffle training or deformation training.
  • Figure 5: Training pipelines of the CLS model.
  • ...and 15 more figures