Table of Contents
Fetching ...

ControlCol: Controllability in Automatic Speaker Video Colorization

Rory Ward, John G. Breslin, Peter Corcoran

TL;DR

ControlCol tackles automatic speaker video colorization by introducing a controllable, temporally consistent framework that merges text-guided colorization with exemplar-based conditioning. The system leverages a text-guided image colorizer (L-CAD) to generate candidate exemplars, ranks them with a face-focused quality metric, and uses the best exemplar to drive a temporally coherent, exemplar-guided video colorizer (DeepRemaster). Empirically, ControlCol achieves state-of-the-art performance on Grid cooke_martin and Lombard Grid across PSNR, SSIM, FID, and FVD, with notable human preference in user studies (about 90% over DeOldify on Grid cooke_martin). The work demonstrates the practical viability of user-guided automatic video colorization for speaker content, while also highlighting the importance of exemplar quality and temporal consistency, and it outlines directions for improving robustness to out-of-domain data.

Abstract

Adding color to black-and-white speaker videos automatically is a highly desirable technique. It is an artistic process that requires interactivity with humans for the best results. Many existing automatic video colorization systems provide little opportunity for the user to guide the colorization process. In this work, we introduce a novel automatic speaker video colorization system which provides controllability to the user while also maintaining high colorization quality relative to state-of-the-art techniques. We name this system ControlCol. ControlCol performs 3.5% better than the previous state-of-the-art DeOldify on the Grid and Lombard Grid datasets when PSNR, SSIM, FID and FVD are used as metrics. This result is also supported by our human evaluation, where in a head-to-head comparison, ControlCol is preferred 90% of the time to DeOldify. Example videos can be seen in the supplementary material.

ControlCol: Controllability in Automatic Speaker Video Colorization

TL;DR

ControlCol tackles automatic speaker video colorization by introducing a controllable, temporally consistent framework that merges text-guided colorization with exemplar-based conditioning. The system leverages a text-guided image colorizer (L-CAD) to generate candidate exemplars, ranks them with a face-focused quality metric, and uses the best exemplar to drive a temporally coherent, exemplar-guided video colorizer (DeepRemaster). Empirically, ControlCol achieves state-of-the-art performance on Grid cooke_martin and Lombard Grid across PSNR, SSIM, FID, and FVD, with notable human preference in user studies (about 90% over DeOldify on Grid cooke_martin). The work demonstrates the practical viability of user-guided automatic video colorization for speaker content, while also highlighting the importance of exemplar quality and temporal consistency, and it outlines directions for improving robustness to out-of-domain data.

Abstract

Adding color to black-and-white speaker videos automatically is a highly desirable technique. It is an artistic process that requires interactivity with humans for the best results. Many existing automatic video colorization systems provide little opportunity for the user to guide the colorization process. In this work, we introduce a novel automatic speaker video colorization system which provides controllability to the user while also maintaining high colorization quality relative to state-of-the-art techniques. We name this system ControlCol. ControlCol performs 3.5% better than the previous state-of-the-art DeOldify on the Grid and Lombard Grid datasets when PSNR, SSIM, FID and FVD are used as metrics. This result is also supported by our human evaluation, where in a head-to-head comparison, ControlCol is preferred 90% of the time to DeOldify. Example videos can be seen in the supplementary material.
Paper Structure (29 sections, 1 equation, 6 figures, 1 table)

This paper contains 29 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: System Architecture of our proposed method. ControlCol (Ours) takes a grayscale video and a text caption as input. It produces a temporally consistent colorized video as output. A text-guided image colorizer, exemplar selection module and an exemplar-guided video colorizer are used in the system.
  • Figure 2: Qualitiative analysis of the colorization methods (DeOldify antic2019deoldify, DeepRemaster IizukaSIGGRAPHASIA2019, ColTran https://doi.org/10.48550/arxiv.2102.04432, GCP wu2022vivid, VCGAN Zhao_2023, L-CAD chang2023lcad, ControlCol (Ours) with BN (BRISQUE 6272356 and NIQE 6353522 exemplar selection) and ControlCol (Ours)) on the datasets (Grid cooke_martin, and Lombard Grid lombardGrid). The ground truth sequences are also provided for reference.
  • Figure 3: Controllability analysis of L-CAD chang2023lcad and ConrolCol (Ours) on three frames taken from the Grid cooke_martin dataset at ten frame intervals starting from the first frame. The video's grayscale and ground truth versions are also provided for reference. The caption is held constant for a fair comparison, "A white male, with dark hair, wearing a green top in front of a red background".
  • Figure 4: The questions asked in the survey. Before answering the questionnaire, the participants were shown two sets of three videos. They were then asked for their opinion on both sets of videos. This was then recorded on the questionnaire before being tallied and analysed.
  • Figure 5: Results of the survey. The Y axis represents the MOS HUANG2022105006 score for each approach. The left cluster of bars refers to Question 1, which regards the Grid cooke_martin dataset. The right cluster of bars refers to Question 2, which regards the Lombard Grid lombardGrid dataset. Blue bars correspond to votes for DeOldify antic2019deoldify, orange bars correspond to the ground truth videos, and green bars correspond to ControlCol (Ours).
  • ...and 1 more figures