SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer

Tong Zhang; Wenxue Cui; Shaohui Liu; Feng Jiang

SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer

Tong Zhang, Wenxue Cui, Shaohui Liu, Feng Jiang

TL;DR

This work tackles the challenge of optimal CNN-Transformer interaction for video post-processing by introducing SC-HVPPNet, a network that fuses local CNN features and global Swin-Transformer features through Spatial Attention Fusion Module (SAFM) and Channel Attention Fusion Module (CAFM) within Hybrid Fusion Blocks. The architecture combines a Local Feature Extraction Module (LFEM) and a Global Feature Extraction Module (GFEM), with fused representations computed as $F_{in}^{i+1}= (W_{lf}^{CS,i} ⊗ F_{lf}^{i}) + (W_{gf}^{CS,i} ⊗ F_{gf}^{i})$, enabling efficient cross-domain feature interaction. A Charbonnier loss with Y:U:V weighting of $10:1:1$ guides training, and the model achieves substantial bitrate savings under VVC RA (e.g., $5.54 ext{\%}$ for Y, $14.18 ext{\%}$ for U, $14.31 ext{\%}$ for V) while outperforming multiple state-of-the-art VPP methods. Overall, SC-HVPPNet demonstrates that hybrid spatial-channel attention effectively leverages image priors to improve video restoration quality with a single model across QP settings, promising practical gains for real-world video coding pipelines.

Abstract

Convolutional Neural Network (CNN) and Transformer have attracted much attention recently for video post-processing (VPP). However, the interaction between CNN and Transformer in existing VPP methods is not fully explored, leading to inefficient communication between the local and global extracted features. In this paper, we explore the interaction between CNN and Transformer in the task of VPP, and propose a novel Spatial and Channel Hybrid-Attention Video Post-Processing Network (SC-HVPPNet), which can cooperatively exploit the image priors in both spatial and channel domains. Specifically, in the spatial domain, a novel spatial attention fusion module is designed, in which two attention weights are generated to fuse the local and global representations collaboratively. In the channel domain, a novel channel attention fusion module is developed, which can blend the deep representations at the channel dimension dynamically. Extensive experiments show that SC-HVPPNet notably boosts video restoration quality, with average bitrate savings of 5.29%, 12.42%, and 13.09% for Y, U, and V components in the VTM-11.0-NNVC RA configuration.

SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer

TL;DR

, enabling efficient cross-domain feature interaction. A Charbonnier loss with Y:U:V weighting of

guides training, and the model achieves substantial bitrate savings under VVC RA (e.g.,

for Y,

for U,

for V) while outperforming multiple state-of-the-art VPP methods. Overall, SC-HVPPNet demonstrates that hybrid spatial-channel attention effectively leverages image priors to improve video restoration quality with a single model across QP settings, promising practical gains for real-world video coding pipelines.

Abstract

Paper Structure (10 sections, 5 equations, 4 figures, 5 tables)

This paper contains 10 sections, 5 equations, 4 figures, 5 tables.

Introduction
THE PROPOSED METHOD
Overview of SC-HVPPNet
The Network Architecture
Loss Function
EXPERIMENTS AND RESULTS
Implementation and Training Details
Comparison with Other VPP Methods
Ablation Study
CONCLUSION

Figures (4)

Figure 1: The architecture of the proposed SC-HVPPNet for video post-processing.
Figure 2: The SAFM and CAFM of SC-HVPPNet.
Figure 3: Visual comparison on the 4-th frame of RaceHorses and BasketballDrill when QP=32 and QP=42.
Figure 4: The architectures of different fusion manners. a) sequential, b) parallel, c) hybrid interaction. $\odot$ is the multiplication operator and $\oplus$ is the element-wise addition.

SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer

TL;DR

Abstract

SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (4)