Prompt-based Consistent Video Colorization
Silvia Dani, Tiberio Uricchio, Lorenzo Seidenari
TL;DR
The paper addresses flickering and labor-intensive colorization of grayscale videos by introducing a language-guided, diffusion-based colorization pipeline that leverages automatic object masks (SAM) and textual prompts. Temporal consistency is achieved through RAFT-based optical flow to propagate chrominance, coupled with a targeted warping-correction step that re-colorizes regions where warping fails. The core per-frame colorization uses L-CAD conditioned on grayscale frames, masks, and prompts, with generic prompts providing strong automatic performance and detailed prompts enabling semantic control. Experiments on DAVIS30 and VIDEVO20 demonstrate state-of-the-art PSNR and strong colorfulness and temporal stability metrics, with dynamic prompts further improving results; the approach offers practical, automated, and controllable video colorization without manual color hints.
Abstract
Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.
