ControlCol: Controllability in Automatic Speaker Video Colorization
Rory Ward, John G. Breslin, Peter Corcoran
TL;DR
ControlCol tackles automatic speaker video colorization by introducing a controllable, temporally consistent framework that merges text-guided colorization with exemplar-based conditioning. The system leverages a text-guided image colorizer (L-CAD) to generate candidate exemplars, ranks them with a face-focused quality metric, and uses the best exemplar to drive a temporally coherent, exemplar-guided video colorizer (DeepRemaster). Empirically, ControlCol achieves state-of-the-art performance on Grid cooke_martin and Lombard Grid across PSNR, SSIM, FID, and FVD, with notable human preference in user studies (about 90% over DeOldify on Grid cooke_martin). The work demonstrates the practical viability of user-guided automatic video colorization for speaker content, while also highlighting the importance of exemplar quality and temporal consistency, and it outlines directions for improving robustness to out-of-domain data.
Abstract
Adding color to black-and-white speaker videos automatically is a highly desirable technique. It is an artistic process that requires interactivity with humans for the best results. Many existing automatic video colorization systems provide little opportunity for the user to guide the colorization process. In this work, we introduce a novel automatic speaker video colorization system which provides controllability to the user while also maintaining high colorization quality relative to state-of-the-art techniques. We name this system ControlCol. ControlCol performs 3.5% better than the previous state-of-the-art DeOldify on the Grid and Lombard Grid datasets when PSNR, SSIM, FID and FVD are used as metrics. This result is also supported by our human evaluation, where in a head-to-head comparison, ControlCol is preferred 90% of the time to DeOldify. Example videos can be seen in the supplementary material.
