VJT: A Video Transformer on Joint Tasks of Deblurring, Low-light Enhancement and Denoising
Yuxiang Hui, Yang Liu, Yaofang Liu, Fan Jia, Jinshan Pan, Raymond Chan, Tieyong Zeng
TL;DR
The paper tackles the problem of joint video restoration—encompassing deblurring, low-light enhancement, and denoising—by introducing VJT, a multi-tier video transformer with a shared encoder, progressive tri-tier decoder, feature fusion between tiers, and an adaptive loss weighting scheme. It also provides MLBN, a dedicated dataset synthesized from RealBlur and YouTube to reflect realistic combined degradations. The approach achieves state-of-the-art performance on MLBN, aided by the adaptive loss balancing and progressive feature refinement across tiers, and demonstrates notable gains over three-stage concatenation baselines. The work advances practical joint video restoration by delivering a scalable architecture and a realistic dataset, with code and data planned for public release to foster further development and benchmarking.
Abstract
Video restoration task aims to recover high-quality videos from low-quality observations. This contains various important sub-tasks, such as video denoising, deblurring and low-light enhancement, since video often faces different types of degradation, such as blur, low light, and noise. Even worse, these kinds of degradation could happen simultaneously when taking videos in extreme environments. This poses significant challenges if one wants to remove these artifacts at the same time. In this paper, to the best of our knowledge, we are the first to propose an efficient end-to-end video transformer approach for the joint task of video deblurring, low-light enhancement, and denoising. This work builds a novel multi-tier transformer where each tier uses a different level of degraded video as a target to learn the features of video effectively. Moreover, we carefully design a new tier-to-tier feature fusion scheme to learn video features incrementally and accelerate the training process with a suitable adaptive weighting scheme. We also provide a new Multiscene-Lowlight-Blur-Noise (MLBN) dataset, which is generated according to the characteristics of the joint task based on the RealBlur dataset and YouTube videos to simulate realistic scenes as far as possible. We have conducted extensive experiments, compared with many previous state-of-the-art methods, to show the effectiveness of our approach clearly.
