L-C4: Language-Based Video Colorization for Creative and Consistent Color

Zheng Chang; Shuchen Weng; Huan Ouyang; Yu Li; Si Li; Boxin Shi

L-C4: Language-Based Video Colorization for Creative and Consistent Color

Zheng Chang, Shuchen Weng, Huan Ouyang, Yu Li, Si Li, Boxin Shi

TL;DR

Language-based video Colorization for Creative and Consistent Colors (L-C4) is presented to guide the colorization process using user-provided language descriptions, built upon a pre-trained cross-modality generative model, leveraging its comprehensive language understanding and robust color representation abilities.

Abstract

Automatic video colorization is inherently an ill-posed problem because each monochrome frame has multiple optional color candidates. Previous exemplar-based video colorization methods restrict the user's imagination due to the elaborate retrieval process. Alternatively, conditional image colorization methods combined with post-processing algorithms still struggle to maintain temporal consistency. To address these issues, we present Language-based video Colorization for Creative and Consistent Colors (L-C4) to guide the colorization process using user-provided language descriptions. Our model is built upon a pre-trained cross-modality generative model, leveraging its comprehensive language understanding and robust color representation abilities. We introduce the cross-modality pre-fusion module to generate instance-aware text embeddings, enabling the application of creative colors. Additionally, we propose temporally deformable attention to prevent flickering or color shifts, and cross-clip fusion to maintain long-term color consistency. Extensive experimental results demonstrate that L-C4 outperforms relevant methods, achieving semantically accurate colors, unrestricted creative correspondence, and temporally robust consistency.

L-C4: Language-Based Video Colorization for Creative and Consistent Color

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 13 figures, 4 tables)

This paper contains 22 sections, 9 equations, 13 figures, 4 tables.

Introduction
Related work
Video colorization
Language-based image colorization
Video diffusion model
Method
Overview
Inter-frame color consistency
Instance-aware text embedding
Long-term consistent inference
Learning and implementation details
Experiment
Comparison with state-of-the-art methods
User study
Ablation study
...and 7 more sections

Figures (13)

Figure 1: Advantages of our language-based video colorization framework compared to relevant colorization methods tcvcbistnetlcad: First row: Automatic methods cannot specify the color of each garment, whereas the language-based method explicitly establishes this correspondence to meet users' expectations. Second row: Exemplar-based method cannot colorize the camel purple due to the difficulty of finding appropriate references, whereas the language-based method allows users to apply creative colors freely. Third row: Image colorization method combined with post-processing algorithms struggles to maintain color consistency when the fish swims rapidly, whereas the language-based method demonstrates greater robustness.
Figure 2: The pipeline of L-C4. (a) During the training phase, video frames are projected into the latent space with a VAE encoder, and noise is subsequently added. The monochrome features $y^{\mathrm{lum}}$ extracted by the luminance (lum) encoder are added to the noised latent codes to align the global structure with the monochrome frames. We equip the denoising U-Net with the Temporal Deformable Attention (TDA) block, ensuring robust inter-frame color consistency. We present the Cross-Modality Pre-Fusion (CMPF) module to generate instance-aware text embeddings, enabling the application of creative colors for specified instances. (b) During the inference phase, we introduce the Cross-Clip Fusion (CCF) to maintain long-term color consistency when colorizing long videos. When decoding the predicted latent code $\tilde{z}^0$, multi-scale monochrome features from the luminance encoder are added into the corresponding scales of the VAE decoder through skip connections to preserve local details.
Figure 3: Illustration of TDA's structure and the different receptive fields. Left: We uniformly sample reference points and estimate offsets for each point to calculate deformed points. After that, we could extract relevant context with multi-head attention. Right: Previous temporal attention only extracts context at fixed spatial locations across frames, struggling to find relevant context when objects move or deform (e.g., the plane's tail wing). The global attention may introduce features from irrelevant regions (e.g., calculating all features) and bring excessive computational consumption. Our proposed TDA can accurately capture relevant context across frames via the estimated deformed points, effectively addressing the aforementioned limitations.
Figure 4: Visual quality comparison with automatic video colorization methods fullyautovcgantcvc.
Figure 5: Visual quality comparison with exemplar-based video colorization methods deepexemplardeepremasterbistnet.
...and 8 more figures

L-C4: Language-Based Video Colorization for Creative and Consistent Color

TL;DR

Abstract

L-C4: Language-Based Video Colorization for Creative and Consistent Color

Authors

TL;DR

Abstract

Table of Contents

Figures (13)