Table of Contents
Fetching ...

Uni-Animator: Towards Unified Visual Colorization

Xinyuan Chen, Yao Xu, Shaowen Wang, Pengjie Song, Bowen Deng

TL;DR

Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

Abstract

We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

Uni-Animator: Towards Unified Visual Colorization

TL;DR

Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

Abstract

We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.
Paper Structure (18 sections, 10 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 10 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Image and Video Sketch Colorization Results. We propose a unified framework for both image and video sketch colorization. Leveraging single or multiple visual references, our method generates results with consistent visual appearance and physical details that align with the input sketch sequence. Our proposed Sketch-based Dynamic RoPE adaptively models motion-aware spatial-temporal dependencies, effectively increasing the consistency of movement in dynamic video scenes.
  • Figure 2: Workflow Comparison: Manual Colorization vs. Ours. Our automated framework eliminates repetitive frame-by-frame adjustments and cross-domain adaptation efforts, significantly reducing labor costs in industrial production.
  • Figure 3: Motivation: Limitations of Existing Methods vs. Our Solution. Existing methods suffer from style deviation (left), detail degradation (right), and temporal flickering (bottom) when handling images or videos separately. Our unified framework addresses all three issues concurrently.
  • Figure 4: Overall architecture of Uni-Animator. Given input visual references, the VAE encoder and physical projector encode its physical features and concatenate with visual latent features, then the DiT model predicts subsequent frames conditioned on physical, visual, and text embeddings. Sketch-based Dynamic RoPE to improve temporal inconsistency in large motion dynamic scenes without extra training.
  • Figure 5: Visual Control Ability. Our method supports accurate semantic matching for image colorization with over five visual references. For video colorization, it enables dynamic customization of characters by altering visual references while preserving original motions.
  • ...and 3 more figures