Table of Contents
Fetching ...

VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization

Zixun Fang, Zhiheng Liu, Kai Zhu, Yu Liu, Ka Leong Cheng, Wei Zhai, Yang Cao, Zheng-Jun Zha

TL;DR

VanGogh addresses the challenges of color bleeding and limited controllability in video colorization by proposing a unified diffusion-based framework that integrates multimodal guidance. The method introduces a Dual QFormer to fuse text and color cues, a Depth Guider for spatial-temporal consistency, and an optical-flow loss to mitigate color overflow, complemented by a color-injection strategy and luma-channel replacement to stabilize VAE reconstructions. A two-stage training regime (image-stage then video-stage) and a robust inference pipeline enable flexible conditioning—from text to exemplars to hints—while maintaining temporal coherence. Extensive qualitative and quantitative evaluations, plus user studies, demonstrate improved color fidelity, temporal stability, and responsiveness to user guidance, establishing VanGogh as a strong, interactive baseline for multimodal video colorization.

Abstract

Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color fidelity.Project page: https://becauseimbatman0.github.io/VanGogh.

VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization

TL;DR

VanGogh addresses the challenges of color bleeding and limited controllability in video colorization by proposing a unified diffusion-based framework that integrates multimodal guidance. The method introduces a Dual QFormer to fuse text and color cues, a Depth Guider for spatial-temporal consistency, and an optical-flow loss to mitigate color overflow, complemented by a color-injection strategy and luma-channel replacement to stabilize VAE reconstructions. A two-stage training regime (image-stage then video-stage) and a robust inference pipeline enable flexible conditioning—from text to exemplars to hints—while maintaining temporal coherence. Extensive qualitative and quantitative evaluations, plus user studies, demonstrate improved color fidelity, temporal stability, and responsiveness to user guidance, establishing VanGogh as a strong, interactive baseline for multimodal video colorization.

Abstract

Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color fidelity.Project page: https://becauseimbatman0.github.io/VanGogh.
Paper Structure (24 sections, 3 equations, 15 figures, 7 tables)

This paper contains 24 sections, 3 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Overall pipeline. We omit the depiction of the VAE encoder and decoder for simplicity. Given a source color video $\mathbf{I}_{\rm gt}^{1:N}$, we first randomly select one frame as the exemplar image and feed it into the Color Projector, where the exemplar is divided into three groups of patches and then passed through the ResBlocks and CLIP image encoder to obtain the color features. The color features are sent to the Dual Qformer along with the encoded prompt, and the calculated features are injected into the UNet through cross attention. For hints injection, we leverage superpixel techniques to synthesize the hints mask $\mathbf{M}^{1:N}$ and the canvas $\mathbf{I}_{\rm canvas}$, which are concatenated with Gaussian noise and the grayscale video $\mathbf{I}_{\rm g}^{1:N}$ to serve as the input for the UNet. Additionally, we design a lightweight Depth Guider to enhance spatial-temporal consistency. During inference, we conduct luma channel replacement between the grayscale video and the output video to alleviate flickering artifacts caused by the video VAE.
  • Figure 2: Color overflow caused by large motion results in optical flow estimation errors.
  • Figure 3: The reconstruction results of the video VAE exhibit flickering artifacts in high-frequency areas. Replacing the luma channel in $Lab$ color space can significantly improve the visual quality.
  • Figure 4: Comparison results for automatic video colorization. VCGAN and SVCNet exhibit severe grayish issues. L-CAD suffers from flickering artifacts; even though it is post-processed by DVP, color bleeding still persists. ColorMNet heavily relies on the colored exemplar frame, and error accumulation occurs, as we can see the top of the car turning black. In contrast, our model can generate temporal-coherent, vivid color videos.
  • Figure 5: Comparison for text-based video colorization. L-CAD+DVP exhibits color bleeding and temporal incoherence. In contrast, our method can generate vivid and natural results that align with given prompts.
  • ...and 10 more figures