LatentColorization: Latent Diffusion-Based Speaker Video Colorization

Rory Ward; Dan Bigioi; Shubhajit Basak; John G. Breslin; Peter Corcoran

LatentColorization: Latent Diffusion-Based Speaker Video Colorization

Rory Ward, Dan Bigioi, Shubhajit Basak, John G. Breslin, Peter Corcoran

TL;DR

This work tackles automatic video colorization with strict temporal coherence by introducing LatentColorization, a latent-diffusion-based system conditioned on exemplar frames and previous frames. By leveraging a VQ-VAE autoencoder, a conditioned latent diffusion process, and an autoregressive framework, the method achieves strong temporal consistency and high fidelity across speaker-video datasets, outperforming several state-of-the-art methods on standard metrics and in a subjective user study. Key technical contributions include explicit temporal conditioning, end-to-end system design, and comprehensive evaluations across GRID, Lombard Grid, and Sherlock Holmes datasets, with notable FVD improvements (~18%) and competitive non-reference quality measures. The results suggest practical potential for automated, high-quality colorization of archival video content, while also acknowledging domain sensitivity and speed limitations inherent to diffusion-based approaches.

Abstract

While current research predominantly focuses on image-based colorization, the domain of video-based colorization remains relatively unexplored. Most existing video colorization techniques operate on a frame-by-frame basis, often overlooking the critical aspect of temporal coherence between successive frames. This approach can result in inconsistencies across frames, leading to undesirable effects like flickering or abrupt color transitions between frames. To address these challenges, we harness the generative capabilities of a fine-tuned latent diffusion model designed specifically for video colorization, introducing a novel solution for achieving temporal consistency in video colorization, as well as demonstrating strong improvements on established image quality metrics compared to other existing methods. Furthermore, we perform a subjective study, where users preferred our approach to the existing state of the art. Our dataset encompasses a combination of conventional datasets and videos from television/movies. In short, by leveraging the power of a fine-tuned latent diffusion-based colorization system with a temporal consistency mechanism, we can improve the performance of automatic video colorization by addressing the challenges of temporal inconsistency. A short demonstration of our results can be seen in some example videos available at https://youtu.be/vDbzsZdFuxM.

LatentColorization: Latent Diffusion-Based Speaker Video Colorization

TL;DR

Abstract

Paper Structure (28 sections, 4 equations, 12 figures, 2 tables)

This paper contains 28 sections, 4 equations, 12 figures, 2 tables.

Introduction
Traditional Colorization
Automatic Colorization
Research Contribution
Related Work
Conventional Deep Learning Approaches
Diffusion Models
Methodology
Design Considerations
Data Processing
System Overview
Image Diffusion Based Set Up
Latent Diffusion Based Set Up
Temporal Consistency
Hyperparemeter and Training Set Up
...and 13 more sections

Figures (12)

Figure 1: "Sherlock Holmes and the Woman in Green" (1945) black-and-white frames.
Figure 2: "Sherlock Holmes and the Woman in Green" (1945) LatentColorization output frames.
Figure 3: Diagram of the Diffusion Process: This diagram illustrates the operation of the diffusion model in both the forward and backward processes. In the forward process, it visually portrays the incremental addition of Gaussian noise to the input image $x_0$ until it becomes visually indistinguishable from Gaussian noise $x_T$ (top). Subsequently, it showcases the learned backward diffusion process, where the model gradually removes the Gaussian noise from $x_T$ to return to the original image $x_0$ (bottom).
Figure 4: Comparison of 3 consecutive frames with different operations applied: First Row (Ground Truth): This row showcases the original, unaltered images, representing the ground truth reference. Second Row (Diffusion Model): In the second row, you can observe the colorization output generated by our original diffusion model. Third Row (Diffusion Model with Post-Processing): Here, the output of the diffusion model is presented with an additional post-processing procedure applied to enhance the results. Fourth Row (LatentColorization): The final row displays the results obtained from LatentColorization .
Figure 5: The system architecture during training is depicted in the diagram, illustrating the key elements of the network and their interactions: Image Encoder: This component is responsible for encoding the input frames into embedding representations. It generates the ground truth embedding $Z_{GT}$, the embedding of the current black-and-white frame $Z_{BW}$, and the embedding of the previous color frame $Z_P$. Denoising Unet: This is a critical part of the architecture, responsible for denoising and refining the embeddings generated by the Image Encoder that have passed through the forward diffusion process. Conditioning Mechanism: The conditioning mechanism is integral to the network, providing contextual information and conditioning signals to guide the colorization process. It takes into account various embeddings, including $Z_{BW}$, $Z_P$, and $Z_{T}$, which represent the black and white input frame, the output of the model at the previous timestep, and the noisy frame to be denoised. Image Decoder: This component is responsible for decoding the predicted frames from their embedding representations. The architecture's design and interactions are essential for the model's training process, ensuring that it learns to generate accurate and temporally consistent colorizations over multiple timesteps.
...and 7 more figures

LatentColorization: Latent Diffusion-Based Speaker Video Colorization

TL;DR

Abstract

LatentColorization: Latent Diffusion-Based Speaker Video Colorization

Authors

TL;DR

Abstract

Table of Contents

Figures (12)