LatentColorization: Latent Diffusion-Based Speaker Video Colorization
Rory Ward, Dan Bigioi, Shubhajit Basak, John G. Breslin, Peter Corcoran
TL;DR
This work tackles automatic video colorization with strict temporal coherence by introducing LatentColorization, a latent-diffusion-based system conditioned on exemplar frames and previous frames. By leveraging a VQ-VAE autoencoder, a conditioned latent diffusion process, and an autoregressive framework, the method achieves strong temporal consistency and high fidelity across speaker-video datasets, outperforming several state-of-the-art methods on standard metrics and in a subjective user study. Key technical contributions include explicit temporal conditioning, end-to-end system design, and comprehensive evaluations across GRID, Lombard Grid, and Sherlock Holmes datasets, with notable FVD improvements (~18%) and competitive non-reference quality measures. The results suggest practical potential for automated, high-quality colorization of archival video content, while also acknowledging domain sensitivity and speed limitations inherent to diffusion-based approaches.
Abstract
While current research predominantly focuses on image-based colorization, the domain of video-based colorization remains relatively unexplored. Most existing video colorization techniques operate on a frame-by-frame basis, often overlooking the critical aspect of temporal coherence between successive frames. This approach can result in inconsistencies across frames, leading to undesirable effects like flickering or abrupt color transitions between frames. To address these challenges, we harness the generative capabilities of a fine-tuned latent diffusion model designed specifically for video colorization, introducing a novel solution for achieving temporal consistency in video colorization, as well as demonstrating strong improvements on established image quality metrics compared to other existing methods. Furthermore, we perform a subjective study, where users preferred our approach to the existing state of the art. Our dataset encompasses a combination of conventional datasets and videos from television/movies. In short, by leveraging the power of a fine-tuned latent diffusion-based colorization system with a temporal consistency mechanism, we can improve the performance of automatic video colorization by addressing the challenges of temporal inconsistency. A short demonstration of our results can be seen in some example videos available at https://youtu.be/vDbzsZdFuxM.
