Table of Contents
Fetching ...

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Zefan Qu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Cairong Zhao

TL;DR

This work addresses Online-VQE, where real-time enhancement must operate with only past and current frames, by introducing STLVQE, a lightweight framework that shares latent features through a Module-Agnostic Feature Extractor and employs Spatial-Temporal Look-up Tables (ST-LUTs) to efficiently capture spatio-temporal information. The method reorganizes propagation, alignment, and enhancement into a cohesive pipeline, using a Temporal Cache and a deformable alignment strategy to minimize redundant computation, and replacing heavy convolutions with LUT-based queries in the Enhancement Module. Two-stage training with Charbonnier and MSE losses enables LUT-based inference without sacrificing learning performance. Extensive experiments on MFQE 2.0 demonstrate real-time 720p processing with strong speed–quality trade-offs, competitive PSNR/SSIM improvements, and a small memory footprint, highlighting the practicality of LUT-based temporal processing for online video enhancement.

Abstract

Low latency rates are crucial for online video-based applications, such as video conferencing and cloud gaming, which make improving video quality in online scenarios increasingly important. However, existing quality enhancement methods are limited by slow inference speed and the requirement for temporal information contained in future frames, making it challenging to deploy them directly in online tasks. In this paper, we propose a novel method, STLVQE, specifically designed to address the rarely studied online video quality enhancement (Online-VQE) problem. Our STLVQE designs a new VQE framework which contains a Module-Agnostic Feature Extractor that greatly reduces the redundant computations and redesign the propagation, alignment, and enhancement module of the network. A Spatial-Temporal Look-up Tables (STL) is proposed, which extracts spatial-temporal information in videos while saving substantial inference time. To the best of our knowledge, we are the first to exploit the LUT structure to extract temporal information in video tasks. Extensive experiments on the MFQE 2.0 dataset demonstrate that our STLVQE achieves a satisfactory performance-speed trade-off.

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

TL;DR

This work addresses Online-VQE, where real-time enhancement must operate with only past and current frames, by introducing STLVQE, a lightweight framework that shares latent features through a Module-Agnostic Feature Extractor and employs Spatial-Temporal Look-up Tables (ST-LUTs) to efficiently capture spatio-temporal information. The method reorganizes propagation, alignment, and enhancement into a cohesive pipeline, using a Temporal Cache and a deformable alignment strategy to minimize redundant computation, and replacing heavy convolutions with LUT-based queries in the Enhancement Module. Two-stage training with Charbonnier and MSE losses enables LUT-based inference without sacrificing learning performance. Extensive experiments on MFQE 2.0 demonstrate real-time 720p processing with strong speed–quality trade-offs, competitive PSNR/SSIM improvements, and a small memory footprint, highlighting the practicality of LUT-based temporal processing for online video enhancement.

Abstract

Low latency rates are crucial for online video-based applications, such as video conferencing and cloud gaming, which make improving video quality in online scenarios increasingly important. However, existing quality enhancement methods are limited by slow inference speed and the requirement for temporal information contained in future frames, making it challenging to deploy them directly in online tasks. In this paper, we propose a novel method, STLVQE, specifically designed to address the rarely studied online video quality enhancement (Online-VQE) problem. Our STLVQE designs a new VQE framework which contains a Module-Agnostic Feature Extractor that greatly reduces the redundant computations and redesign the propagation, alignment, and enhancement module of the network. A Spatial-Temporal Look-up Tables (STL) is proposed, which extracts spatial-temporal information in videos while saving substantial inference time. To the best of our knowledge, we are the first to exploit the LUT structure to extract temporal information in video tasks. Extensive experiments on the MFQE 2.0 dataset demonstrate that our STLVQE achieves a satisfactory performance-speed trade-off.
Paper Structure (15 sections, 11 equations, 5 figures, 4 tables)

This paper contains 15 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of $\Delta$PSNR/$\Delta$SSIM, average runtime per frame(720P video) and parameters on MFQE 2.0 test set at QP=37. Our STLVQE method achieves a great trade-off in enhancement performance and inference speed.
  • Figure 2: The framework of STLVQE (inference phase), which consists of a Module-Agnostic Feature Extractor and three main parts: the propagation, alignment and enhancement module. In the inference phase, the propagation module selects the reference frame and accesses the relevant information, which is sent to the alignment module for temporal alignment, and finally the aligned reference frames and enhancing frame are input to the enhancement module for the final results.
  • Figure 3: Reference Frame Window and Temporal Cache in the propagation module.
  • Figure 4: The specific schematic of our Spatial-Temporal Look-up Tables, which consists of two parts: Temporal-LUT (left side) and Spatial-LUT (right side).
  • Figure 5: Qualitative results at QP 37. The STLVQE method successfully enhances the left leg of the athlete, the face of the old man, the railings of the bridge and the eyebrow of the woman in each frame, respectively.