Towards High-resolution and Disentangled Reference-based Sketch Colorization

Dingkun Yan; Xinrui Wang; Ru Wang; Zhuoru Li; Jinze Yu; Yusuke Iwasawa; Yutaka Matsuo; Jiaxian Guo

Towards High-resolution and Disentangled Reference-based Sketch Colorization

Dingkun Yan, Xinrui Wang, Ru Wang, Zhuoru Li, Jinze Yu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo

TL;DR

This paper proposes a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively, thereby achieving superior quality, resolution, and controllability of colorization.

Abstract

Sketch colorization is a critical task for automating and assisting in the creation of animations and digital illustrations. Previous research identified the primary difficulty as the distribution shift between semantically aligned training data and highly diverse test data, and focused on mitigating the artifacts caused by the distribution shift instead of fundamentally resolving the problem. In this paper, we present a framework that directly minimizes the distribution shift, thereby achieving superior quality, resolution, and controllability of colorization. We propose a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively. A Gram Regularization Loss is applied across the feature maps of both branches, effectively enforcing cross-domain distribution coherence and stability. Furthermore, we adopt an anime-specific Tagger Network to extract fine-grained attributions from reference images and modulate SDXL's conditional encoders to ensure precise control, and a plugin module to enhance texture transfer. Quantitative and qualitative comparisons, alongside user studies, confirm that our method effectively overcomes the distribution shift challenge, establishing State-of-the-Art performance across both quality and controllability metrics. Ablation study reveals the influence of each component.

Towards High-resolution and Disentangled Reference-based Sketch Colorization

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 10 figures, 1 table)

This paper contains 16 sections, 2 equations, 10 figures, 1 table.

Introduction
Related work
Latent Diffusion Models
Image Referenced Diffusion Models
Sketch Colorization
Method
Distribution Shift and Spatial Entanglement
Optimize the Distribution Shift with Gram Loss
Precise Attribution Control by WD-Tagger
Feature-level Plugin
Experiment
Implementation detials
Ablation study
Comparison with baselines
Cross-content validation
...and 1 more sections

Figures (10)

Figure 1: Left: The proposed method synthesizes colorized results in higher resolutions with accurate colors and vivid textures for inputs with various styles and contents compared to latest image-guided sketch colorization methods liu2025manganinjayan2025image, Right: The proposed dual-branch architecture and gram regularization loss effectively eliminate the side effects of distribution shift.
Figure 2: The model incorrectly learns spatial semantics from reference images, which contradict the spatial semantics from sketch images and cause spatial entanglement.
Figure 3: As training progresses, the model increasingly transfers spatial semantics from the reference images into the colorized results, leading to deviations from the correct sketch-based segmentation. The ground-truth Gram matrix is obtained by discarding the reference inputs during inference, and query tokens in the Gram matrices are highlighted with red points.
Figure 4: The left panel illustrates the architecture of the proposed framework, while the right panel shows the computation of the Gram loss. In the first stage, the backbone is trained for reference-based colorization using image embeddings, where the embedding inputs to the denoising U-Net are extracted from the entire reference image (indicated by the red arrow). The Gram loss is activated only during this first training stage. In the subsequent stages, we introduce feature-level representations for foreground and background regions through their respective plugin adapters. During inference, the plugin adapters are executed only once at timestep $t=0$.
Figure 5: Ablation results of WD-tagger. The model without the WD tagger fails to correctly colorize the eyes when reference eyes are small and color mismatched. It also shows weaker segmentation guidance overall. Both ablation variants exhibit artifacts without Gram regularization in (e). FID scores are shown on the left.
...and 5 more figures

Towards High-resolution and Disentangled Reference-based Sketch Colorization

TL;DR

Abstract

Towards High-resolution and Disentangled Reference-based Sketch Colorization

Authors

TL;DR

Abstract

Table of Contents

Figures (10)