Table of Contents
Fetching ...

InstanceAnimator: Multi-Instance Sketch Video Colorization

Yinhan Zhang, Yue Ma, Bingyuan Wang, Kunyu Feng, Yeying Jin, Qifeng Chen, Anyi Rao, Zeyu Wang

Abstract

We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.

InstanceAnimator: Multi-Instance Sketch Video Colorization

Abstract

We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.

Paper Structure

This paper contains 24 sections, 8 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Example results of InstanceAnimator. Given diverse instances, sketch sequences, and textual descriptions, our framework enables high-quality, controllable video colorization with multi-instance and background customization.
  • Figure 2: Motivation. Unlike traditional methods that require multi-stage, time-consuming, and first-frame-dependent colorization, InstanceAnimator directly colorizes sketch sequences into videos after background and character design. Our method no longer relies on a single reference keyframe but adjusts end-to-end instance-aware colorization, providing decoupled control to achieve higher user flexibility and significantly reducing time consumption and labor.
  • Figure 3: Overview of InstanceAnimator. We first apply instance-aware attention with instance latent features and noise features to establish a correspondence between the line drawing and the reference instances, as well as to maintain the character feature. Concurrently, instances, background, and text descriptions are fed into the Adaptive Decoupled Control Module independently, which dynamically injects condition information into DiT blocks through three condition-specific expert modules. At the inference stage, users can adjust the conditional weights and freely change the reference instances and image background to enhance controllability and creative flexibility.
  • Figure 4: (Left): Condition Fusion. The sketch, canvas, and background conditions are temporally aligned and concatenated along the channel dimension. (Right): Instance-Aware Attention Mask. This design enables the model to capture the correspondence between reference instances and sketches while avoiding a sharp increase in computational complexity.
  • Figure 5: Instance Control Ability.(Up) Given the same sketch and different reference instances, InstanceAnimator generates a variety of colorful videos. (Down) Using the same designed characters, our framework colorizes different sketches with consistent colors and user-customized backgrounds.
  • ...and 8 more figures