Table of Contents
Fetching ...

OmniColor: A Unified Framework for Multi-modal Lineart Colorization

Xulu Zhang, Haoqian Du, Xiaoyong Wei, Qing Li

Abstract

Lineart colorization is a critical stage in professional content creation, yet achieving precise and flexible results under diverse user constraints remains a significant challenge. To address this, we propose OmniColor, a unified framework for multi-modal lineart colorization that supports arbitrary combinations of control signals. Specifically, we systematically categorize guidance signals into two types: spatially-aligned conditions and semantic-reference conditions. For spatially-aligned inputs, we employ a dual-path encoding strategy paired with a Dense Feature Alignment loss to ensure rigorous boundary preservation and precise color restoration. For semantic-reference inputs, we utilize a VLM-only encoding scheme integrated with a Temporal Redundancy Elimination mechanism to filter repetitive information and enhance inference efficiency. To resolve potential input conflicts, we introduce an Adaptive Spatial-Semantic Gating module that dynamically balances multi-modal constraints. Experimental results demonstrate that OmniColor achieves superior controllability, visual quality, and temporal stability, providing a robust and practical solution for lineart colorization. The source code and dataset will be open at https://github.com/zhangxulu1996/OmniColor.

OmniColor: A Unified Framework for Multi-modal Lineart Colorization

Abstract

Lineart colorization is a critical stage in professional content creation, yet achieving precise and flexible results under diverse user constraints remains a significant challenge. To address this, we propose OmniColor, a unified framework for multi-modal lineart colorization that supports arbitrary combinations of control signals. Specifically, we systematically categorize guidance signals into two types: spatially-aligned conditions and semantic-reference conditions. For spatially-aligned inputs, we employ a dual-path encoding strategy paired with a Dense Feature Alignment loss to ensure rigorous boundary preservation and precise color restoration. For semantic-reference inputs, we utilize a VLM-only encoding scheme integrated with a Temporal Redundancy Elimination mechanism to filter repetitive information and enhance inference efficiency. To resolve potential input conflicts, we introduce an Adaptive Spatial-Semantic Gating module that dynamically balances multi-modal constraints. Experimental results demonstrate that OmniColor achieves superior controllability, visual quality, and temporal stability, providing a robust and practical solution for lineart colorization. The source code and dataset will be open at https://github.com/zhangxulu1996/OmniColor.

Paper Structure

This paper contains 17 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Demonstration of OmniColor's unified colorization capabilities. From a single lineart (L), our model can generate diverse, high-quality colorized results by integrating different combinations of text (T), color hints (C), identity reference (I), and temporal history (H).
  • Figure 2: The architecture of OmniColor. Our framework categorizes auxiliary signals into spatially-aligned conditions ($C_{spat}$) and semantic-reference conditions ($C_{sem}$). $C_{spat}$ is processed via a dual-encoder strategy to preserve structural details. $C_{sem}$ is processed via a VLM-only encoder, where a TRE module filters redundant tokens. The MMDiT backbone integrates these features with AS-Gate module to resolve input conflicts. The model is optimized using Flow Matching Loss ($L_{FM}$) and a DFA loss.
  • Figure 3: Qualitative comparison of prompt-based lineart colorization. We compare our method with Tag2pix kim2019tag2pix and SoTA multi-modal generation models. Our framework achieves superior structural fidelity and prompt alignment.
  • Figure 4: Qualitative results of reference-based lineart colorization. Our method maintains robust identity and stylistic consistency even under various motions (rotation, zoom) and significant viewpoint changes.
  • Figure 5: Synergistic control with multiple modalities. We show the effects of combining lineart (L) and text (T) with ID references (I) and color hints (C). The results demonstrate that our model effectively integrates multi-modal signals.
  • ...and 1 more figures