Table of Contents
Fetching ...

ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization

Yixin Yang, Jiangxin Dong, Jinhui Tang, Jinshan Pan

TL;DR

ColorMNet addresses the challenge of video colorization by introducing a memory-based feature propagation (MFP) module to connect far-apart frames, a large-pretrained visual model guided feature estimation (PVGFE) to extract robust per-frame features, and a local attention (LA) module to exploit adjacent-frame similarities. These components form an end-to-end trainable network that reduces memory usage while preserving long-range temporal information and semantic-rich spatial features. Extensive experiments on DAVIS Perazzi_CVPR_2016, Videvo Lai2018videvo, and NVCC2023 show competitive PSNR/SSIM/FID/LPIPS metrics, improved temporal consistency (CDC), and superior efficiency compared to state-of-the-art exemplar-based approaches. The method demonstrates strong color fidelity and robustness in real-world videos, with a parameter count around 123.6M and notable memory and speed advantages over baseline stacking or recurrent strategies.

Abstract

How to effectively explore spatial-temporal features is important for video colorization. Instead of stacking multiple frames along the temporal dimension or recurrently propagating estimated features that will accumulate errors or cannot explore information from far-apart frames, we develop a memory-based feature propagation module that can establish reliable connections with features from far-apart frames and alleviate the influence of inaccurately estimated features. To extract better features from each frame for the above-mentioned feature propagation, we explore the features from large-pretrained visual models to guide the feature estimation of each frame so that the estimated features can model complex scenarios. In addition, we note that adjacent frames usually contain similar contents. To explore this property for better spatial and temporal feature utilization, we develop a local attention module to aggregate the features from adjacent frames in a spatial-temporal neighborhood. We formulate our memory-based feature propagation module, large-pretrained visual model guided feature estimation module, and local attention module into an end-to-end trainable network (named ColorMNet) and show that it performs favorably against state-of-the-art methods on both the benchmark datasets and real-world scenarios. The source code and pre-trained models will be available at \url{https://github.com/yyang181/colormnet}.

ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization

TL;DR

ColorMNet addresses the challenge of video colorization by introducing a memory-based feature propagation (MFP) module to connect far-apart frames, a large-pretrained visual model guided feature estimation (PVGFE) to extract robust per-frame features, and a local attention (LA) module to exploit adjacent-frame similarities. These components form an end-to-end trainable network that reduces memory usage while preserving long-range temporal information and semantic-rich spatial features. Extensive experiments on DAVIS Perazzi_CVPR_2016, Videvo Lai2018videvo, and NVCC2023 show competitive PSNR/SSIM/FID/LPIPS metrics, improved temporal consistency (CDC), and superior efficiency compared to state-of-the-art exemplar-based approaches. The method demonstrates strong color fidelity and robustness in real-world videos, with a parameter count around 123.6M and notable memory and speed advantages over baseline stacking or recurrent strategies.

Abstract

How to effectively explore spatial-temporal features is important for video colorization. Instead of stacking multiple frames along the temporal dimension or recurrently propagating estimated features that will accumulate errors or cannot explore information from far-apart frames, we develop a memory-based feature propagation module that can establish reliable connections with features from far-apart frames and alleviate the influence of inaccurately estimated features. To extract better features from each frame for the above-mentioned feature propagation, we explore the features from large-pretrained visual models to guide the feature estimation of each frame so that the estimated features can model complex scenarios. In addition, we note that adjacent frames usually contain similar contents. To explore this property for better spatial and temporal feature utilization, we develop a local attention module to aggregate the features from adjacent frames in a spatial-temporal neighborhood. We formulate our memory-based feature propagation module, large-pretrained visual model guided feature estimation module, and local attention module into an end-to-end trainable network (named ColorMNet) and show that it performs favorably against state-of-the-art methods on both the benchmark datasets and real-world scenarios. The source code and pre-trained models will be available at \url{https://github.com/yyang181/colormnet}.
Paper Structure (11 sections, 13 equations, 10 figures, 5 tables)

This paper contains 11 sections, 13 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Colorization results on a real-world video and model performance comparisons between our proposed ColorMNet and other methods on the DAVIS Perazzi_CVPR_2016 dataset in terms of PSNR and running time. State-of-the-art methods zhang2019deepIizukaSIGGRAPHASIA2019 do not generate well-colorized images in (b) and (c). In contrast, by exploring the features from large-pretrained visual models to estimate robust spatial features for each frame, effectively propagating these features along the temporal dimension based on memory mechanisms for far-apart frames, and exploiting the video property that adjacent frames contain similar contents, our method accurately restores the colors on the grass and generates a realistic image in (d). (e) shows that the proposed ColorMNet performs favorably against state-of-the-art methods in terms of accuracy and running time. The size of the test images for measuring the running time is $960 \times 536$ pixels.
  • Figure 2: An overview of the proposed ColorMNet. The core components of our method include: (a) large-pretrained visual model guided feature estimation (PVGFE) module, (b) memory-based feature propagation (MFP) module and (c) local attention (LA).
  • Figure 3: Qualitative comparisons on clip parkour from the validation set of DAVIS Perazzi_CVPR_2016 dataset. (a)-(g) are the colorization results by DDColor kang2022ddcolor, TCVC liu2021temporally, VCGAN vcgan, DeepRemaster IizukaSIGGRAPHASIA2019, DeepExemplar zhang2019deep, BiSTNet$^\dagger$bistnet and ColorMNet (Ours). (h) Ground truth. The evaluated methods do not generate realistic colorful images in (a)-(f). In contrast, our approach generates a well-colorized image in (g).
  • Figure 4: Qualitative colorization comparisons on real-world video Manhattan (1979). (a) Input frame. (b) Exemplar images obtained by Google Image Search. (c)-(e) are the colorization results by DeepRemaster IizukaSIGGRAPHASIA2019, DeepExemplar zhang2019deep and ColorMNet (Ours), respectively. The methods zhang2019deepIizukaSIGGRAPHASIA2019 do not colorize the wall of the building, the trees, and the sky well in (c) and (d). Our ColorMNet generates error-free and realistic colors in (e).
  • Figure 5: Effectiveness of PVGFE for video colorization. (a) Input patch. (b)-(e) are the colorization results by ColorMNet$_{\text{w/ ResNet50}}$, ColorMNet$_{\text{w/ DINOv2}}$, ColorMNet$_{\text{w/ Concatenation}}$ and ColorMNet (Ours), respectively. (f) Ground truth. Compared to the baselines, our approach yields a more natural colorized result in (e).
  • ...and 5 more figures