Table of Contents
Fetching ...

Prompt-based Consistent Video Colorization

Silvia Dani, Tiberio Uricchio, Lorenzo Seidenari

TL;DR

The paper addresses flickering and labor-intensive colorization of grayscale videos by introducing a language-guided, diffusion-based colorization pipeline that leverages automatic object masks (SAM) and textual prompts. Temporal consistency is achieved through RAFT-based optical flow to propagate chrominance, coupled with a targeted warping-correction step that re-colorizes regions where warping fails. The core per-frame colorization uses L-CAD conditioned on grayscale frames, masks, and prompts, with generic prompts providing strong automatic performance and detailed prompts enabling semantic control. Experiments on DAVIS30 and VIDEVO20 demonstrate state-of-the-art PSNR and strong colorfulness and temporal stability metrics, with dynamic prompts further improving results; the approach offers practical, automated, and controllable video colorization without manual color hints.

Abstract

Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.

Prompt-based Consistent Video Colorization

TL;DR

The paper addresses flickering and labor-intensive colorization of grayscale videos by introducing a language-guided, diffusion-based colorization pipeline that leverages automatic object masks (SAM) and textual prompts. Temporal consistency is achieved through RAFT-based optical flow to propagate chrominance, coupled with a targeted warping-correction step that re-colorizes regions where warping fails. The core per-frame colorization uses L-CAD conditioned on grayscale frames, masks, and prompts, with generic prompts providing strong automatic performance and detailed prompts enabling semantic control. Experiments on DAVIS30 and VIDEVO20 demonstrate state-of-the-art PSNR and strong colorfulness and temporal stability metrics, with dynamic prompts further improving results; the approach offers practical, automated, and controllable video colorization without manual color hints.

Abstract

Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.

Paper Structure

This paper contains 19 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed framework. Our method takes grayscale video frames as input, along with automatically generated object masks from SAM and a textual prompt (generic for automatic mode, detailed for guided mode).
  • Figure 2: Textual prompt generated using ground truth color information to simulate ideal guidance from image \ref{['fig:cow']}: Cow brown and white in the center. Grass green in the background. Soil dark brown in the foreground. Fence dark brown at the bottom. Textual prompt from image \ref{['fig:horsejump-high']}: Horse is brown and is on the left. Rider is wearing white and is on the horse. Fence is white and is in the center. Ground is beige and covers the bottom. Trees are green and are in the background. Sky is blue and is at the top. Clouds are white and are in the sky. Plants are green and are in the foreground.
  • Figure 3: Example of color shifting from initial frame \ref{['fig:first-frame']} to the 11th frame \ref{['fig:color-shift']}.
  • Figure 4: Example of frame colorization with and without changing prompt every $\Delta t$ and change of scene in DAVIS30 pairs.
  • Figure 5: Example of frame colorization with and without warping correction.