Table of Contents
Fetching ...

VJT: A Video Transformer on Joint Tasks of Deblurring, Low-light Enhancement and Denoising

Yuxiang Hui, Yang Liu, Yaofang Liu, Fan Jia, Jinshan Pan, Raymond Chan, Tieyong Zeng

TL;DR

The paper tackles the problem of joint video restoration—encompassing deblurring, low-light enhancement, and denoising—by introducing VJT, a multi-tier video transformer with a shared encoder, progressive tri-tier decoder, feature fusion between tiers, and an adaptive loss weighting scheme. It also provides MLBN, a dedicated dataset synthesized from RealBlur and YouTube to reflect realistic combined degradations. The approach achieves state-of-the-art performance on MLBN, aided by the adaptive loss balancing and progressive feature refinement across tiers, and demonstrates notable gains over three-stage concatenation baselines. The work advances practical joint video restoration by delivering a scalable architecture and a realistic dataset, with code and data planned for public release to foster further development and benchmarking.

Abstract

Video restoration task aims to recover high-quality videos from low-quality observations. This contains various important sub-tasks, such as video denoising, deblurring and low-light enhancement, since video often faces different types of degradation, such as blur, low light, and noise. Even worse, these kinds of degradation could happen simultaneously when taking videos in extreme environments. This poses significant challenges if one wants to remove these artifacts at the same time. In this paper, to the best of our knowledge, we are the first to propose an efficient end-to-end video transformer approach for the joint task of video deblurring, low-light enhancement, and denoising. This work builds a novel multi-tier transformer where each tier uses a different level of degraded video as a target to learn the features of video effectively. Moreover, we carefully design a new tier-to-tier feature fusion scheme to learn video features incrementally and accelerate the training process with a suitable adaptive weighting scheme. We also provide a new Multiscene-Lowlight-Blur-Noise (MLBN) dataset, which is generated according to the characteristics of the joint task based on the RealBlur dataset and YouTube videos to simulate realistic scenes as far as possible. We have conducted extensive experiments, compared with many previous state-of-the-art methods, to show the effectiveness of our approach clearly.

VJT: A Video Transformer on Joint Tasks of Deblurring, Low-light Enhancement and Denoising

TL;DR

The paper tackles the problem of joint video restoration—encompassing deblurring, low-light enhancement, and denoising—by introducing VJT, a multi-tier video transformer with a shared encoder, progressive tri-tier decoder, feature fusion between tiers, and an adaptive loss weighting scheme. It also provides MLBN, a dedicated dataset synthesized from RealBlur and YouTube to reflect realistic combined degradations. The approach achieves state-of-the-art performance on MLBN, aided by the adaptive loss balancing and progressive feature refinement across tiers, and demonstrates notable gains over three-stage concatenation baselines. The work advances practical joint video restoration by delivering a scalable architecture and a realistic dataset, with code and data planned for public release to foster further development and benchmarking.

Abstract

Video restoration task aims to recover high-quality videos from low-quality observations. This contains various important sub-tasks, such as video denoising, deblurring and low-light enhancement, since video often faces different types of degradation, such as blur, low light, and noise. Even worse, these kinds of degradation could happen simultaneously when taking videos in extreme environments. This poses significant challenges if one wants to remove these artifacts at the same time. In this paper, to the best of our knowledge, we are the first to propose an efficient end-to-end video transformer approach for the joint task of video deblurring, low-light enhancement, and denoising. This work builds a novel multi-tier transformer where each tier uses a different level of degraded video as a target to learn the features of video effectively. Moreover, we carefully design a new tier-to-tier feature fusion scheme to learn video features incrementally and accelerate the training process with a suitable adaptive weighting scheme. We also provide a new Multiscene-Lowlight-Blur-Noise (MLBN) dataset, which is generated according to the characteristics of the joint task based on the RealBlur dataset and YouTube videos to simulate realistic scenes as far as possible. We have conducted extensive experiments, compared with many previous state-of-the-art methods, to show the effectiveness of our approach clearly.
Paper Structure (28 sections, 10 equations, 8 figures, 5 tables)

This paper contains 28 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Restoration results on a part of a frame from a daytime outdoor scene of our dataset. Our method simultaneously removes blur, enhances low light and eliminates noise. Compared to three recent methods retrained on our MLBN dataset, our method improves image quality in terms of object detail.
  • Figure 2: An illustration of the proposed VJT. It contains a single tier Encoder and a Multi-tier Decoder. There are multiple scales Attention-Warping modules in both the encoder and decoder. A shallow feature extraction module begins, while four attention and a reconstruction module are at the end. Each tier obtains a video with a different level of restoration. Feature fusion modules between tiers can transfer features for progressive joint tasks. The skip connections have been omitted for clarity.
  • Figure 3: Illustrations for Feature Fusion between tiers and Attention Module in VJT. (a) shows the feature fusion between tier1 and tier2 at the first Attention-Warping module of the Decoder. (b) illustrates the attention module, which consists of $N$ sub-parts containing W-MMA and $M$ sub-parts with only W-MSA.
  • Figure 4: A concise diagram illustrates our data synthesis process.
  • Figure 5: Qualitative visual comparisons on a frame of a night-time outdoor street scene in our MLBN dataset. Our approach demonstrates enhanced capabilities in restoring nighttime illumination, deblurring, and denoising. We mask it to show it more visually.
  • ...and 3 more figures