Table of Contents
Fetching ...

Image Referenced Sketch Colorization Based on Animation Creation Workflow

Dingkun Yan, Xinrui Wang, Zhuoru Li, Suguru Saito, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo

TL;DR

This work tackles automatic sketch colorization in animation pipelines by combining a diffusion-based model with spatially aware color guidance. It introduces a split cross-attention mechanism and switchable LoRA modules to separately colorize foreground and background, guided by a sketch and a reference image, thereby eliminating spatial artifacts. The approach mirrors real-world production steps, uses high-dimensional local reference tokens, and adds a recovery transformer to fuse foreground and background information. Empirical results show improved artifact suppression, color fidelity, and user preference over baselines across qualitative, quantitative, and human studies, with potential impact on speeding up animation colorization workflows. Limitations include dependency on accurate masks, with future work extending to video colorization.

Abstract

Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial guidance and an RGB image as the color reference, and separately extracts foreground and background from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules. They are trained separately with foreground and background regions to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the spatial artifacts. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component. Codes are available at https://github.com/ tellurion-kanata/colorizeDiffusion.

Image Referenced Sketch Colorization Based on Animation Creation Workflow

TL;DR

This work tackles automatic sketch colorization in animation pipelines by combining a diffusion-based model with spatially aware color guidance. It introduces a split cross-attention mechanism and switchable LoRA modules to separately colorize foreground and background, guided by a sketch and a reference image, thereby eliminating spatial artifacts. The approach mirrors real-world production steps, uses high-dimensional local reference tokens, and adds a recovery transformer to fuse foreground and background information. Empirical results show improved artifact suppression, color fidelity, and user preference over baselines across qualitative, quantitative, and human studies, with potential impact on speeding up animation colorization workflows. Limitations include dependency on accurate masks, with future work extending to video colorization.

Abstract

Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial guidance and an RGB image as the color reference, and separately extracts foreground and background from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules. They are trained separately with foreground and background regions to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the spatial artifacts. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component. Codes are available at https://github.com/ tellurion-kanata/colorizeDiffusion.

Paper Structure

This paper contains 15 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Given reference images, our proposed method automatically synthesizes high-quality sketch colorization results that loyally match the reference color distribution and are free from artifacts.
  • Figure 2: Illustration of spatial entanglement. We use red rectangles to highlight the spatial entangled artifacts in the result of the IP-Adapter baseline, where additional arms appear unexpectedly, and the model mistakenly synthesizes long hair.
  • Figure 3: Illustration of colorization workflow in professional animation studios. A: character designers design characters as references. B: Senior animators draw the sketches for the key frames. C: animators colorize the figures in the sketches according to the character designs, and D: animators colorize the background of the sketches and merge foreground and background into finished frames.
  • Figure 4: Illustration of the proposed framework. We use reference masks to separate reference images into foreground and background and CLIP Image encoder $\phi$ to extract both regions into embeddings. The background embeddings first go through the recovery transformer $\varphi$ to recover detailed information, then concatenated with foreground embeddings as final K and V inputs for split cross-attention. Similar to Eq \ref{['split-attention']}, the compose operation is a spatial piece-wise function employed to separate foreground and background.
  • Figure 5: Based on the LoRA weights, the proposed method can merge the foreground and background features in one forward pass and switch between three inference modes. We denote the dimension of pre-trained weights as CH. The rank of foreground LoRA is fixed at 16, while the rank of background LoRA is 0.5*CH.
  • ...and 5 more figures