Table of Contents
Fetching ...

Joint Reference Frame Synthesis and Post Filter Enhancement for Versatile Video Coding

Weijie Bao, Yuantong Zhang, Jianghao Jia, Zhenzhong Chen, Shan Liu

TL;DR

This work tackles the efficiency gap in Versatile Video Coding (VVC) by jointly leveraging neural network-based reference frame synthesis (RFS) and post-processing filter enhancement (PFE). It introduces STENet, a dual-pipeline network that jointly performs synthesis (to generate virtual references) and enhancement (to reduce artifacts) from two input frames, and its coordination within a Space-Time Enhancement Window (STEW) for VVC RA configuration. A key contribution is the Joint Inference of RFS and PFE (JISE), enabling single-pass execution to reduce inference overhead, along with a joint training scheme that optimizes both pipelines simultaneously. Experimental results on VTM-15.0 RA show substantial compression and quality gains (PSNR-based BD-rate improvements of up to $-7.34 ext{ extendash}-7.95 ext{ extendash}-16.65$ ext{for the Y, U, V components respectively}) and MS-SSIM gains, validating the approach and its compatibility with existing standards, albeit with increased computation during encoding and decoding.

Abstract

This paper presents the joint reference frame synthesis (RFS) and post-processing filter enhancement (PFE) for Versatile Video Coding (VVC), aiming to explore the combination of different neural network-based video coding (NNVC) tools to better utilize the hierarchical bi-directional coding structure of VVC. Both RFS and PFE utilize the Space-Time Enhancement Network (STENet), which receives two input frames with artifacts and produces two enhanced frames with suppressed artifacts, along with an intermediate synthesized frame. STENet comprises two pipelines, the synthesis pipeline and the enhancement pipeline, tailored for different purposes. During RFS, two reconstructed frames are sent into STENet's synthesis pipeline to synthesize a virtual reference frame, similar to the current to-be-coded frame. The synthesized frame serves as an additional reference frame inserted into the reference picture list (RPL). During PFE, two reconstructed frames are fed into STENet's enhancement pipeline to alleviate their artifacts and distortions, resulting in enhanced frames with reduced artifacts and distortions. To reduce inference complexity, we propose joint inference of RFS and PFE (JISE), achieved through a single execution of STENet. Integrated into the VVC reference software VTM-15.0, RFS, PFE, and JISE are coordinated within a novel Space-Time Enhancement Window (STEW) under Random Access (RA) configuration. The proposed method could achieve -7.34%/-17.21%/-16.65% PSNR-based BD-rate on average for three components under RA configuration.

Joint Reference Frame Synthesis and Post Filter Enhancement for Versatile Video Coding

TL;DR

This work tackles the efficiency gap in Versatile Video Coding (VVC) by jointly leveraging neural network-based reference frame synthesis (RFS) and post-processing filter enhancement (PFE). It introduces STENet, a dual-pipeline network that jointly performs synthesis (to generate virtual references) and enhancement (to reduce artifacts) from two input frames, and its coordination within a Space-Time Enhancement Window (STEW) for VVC RA configuration. A key contribution is the Joint Inference of RFS and PFE (JISE), enabling single-pass execution to reduce inference overhead, along with a joint training scheme that optimizes both pipelines simultaneously. Experimental results on VTM-15.0 RA show substantial compression and quality gains (PSNR-based BD-rate improvements of up to ext{for the Y, U, V components respectively}) and MS-SSIM gains, validating the approach and its compatibility with existing standards, albeit with increased computation during encoding and decoding.

Abstract

This paper presents the joint reference frame synthesis (RFS) and post-processing filter enhancement (PFE) for Versatile Video Coding (VVC), aiming to explore the combination of different neural network-based video coding (NNVC) tools to better utilize the hierarchical bi-directional coding structure of VVC. Both RFS and PFE utilize the Space-Time Enhancement Network (STENet), which receives two input frames with artifacts and produces two enhanced frames with suppressed artifacts, along with an intermediate synthesized frame. STENet comprises two pipelines, the synthesis pipeline and the enhancement pipeline, tailored for different purposes. During RFS, two reconstructed frames are sent into STENet's synthesis pipeline to synthesize a virtual reference frame, similar to the current to-be-coded frame. The synthesized frame serves as an additional reference frame inserted into the reference picture list (RPL). During PFE, two reconstructed frames are fed into STENet's enhancement pipeline to alleviate their artifacts and distortions, resulting in enhanced frames with reduced artifacts and distortions. To reduce inference complexity, we propose joint inference of RFS and PFE (JISE), achieved through a single execution of STENet. Integrated into the VVC reference software VTM-15.0, RFS, PFE, and JISE are coordinated within a novel Space-Time Enhancement Window (STEW) under Random Access (RA) configuration. The proposed method could achieve -7.34%/-17.21%/-16.65% PSNR-based BD-rate on average for three components under RA configuration.
Paper Structure (29 sections, 17 equations, 6 figures, 6 tables)

This paper contains 29 sections, 17 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The framework of joint reference frame synthesis (RFS) and post-processing filter enhancement (PFE). During RFS, two reconstructed frames from DPB are input into STENet's synthesis pipeline to synthesize an intermediate frame, treated as the virtual reference frame, and inserted into two RPLs. During PFE, two reconstructed frames are selected from DPB and input into STENet's enhancement pipeline to alleviate their artifacts and distortions, resulting in higher-quality final output frames.
  • Figure 2: The network architecture of the Space-Time Enhancement Network (STENet). STENet takes two compressed frames with artifacts and distortions, fully leveraging the space-time information to generate an intermediate synthesized frame and two enhanced frames with reduced artifacts. STENet comprises two pipelines: the synthesis pipeline (in blue) for RFS and the enhancement pipeline (in green) for PFE.
  • Figure 3: The Space-Time Enhancement Window (STEW) is proposed to manage RFS, PFE, and JISE effectively under RA configuration. Each STEW contains 8 consecutive frames. Within the current STEW, we determine whether to execute RFS, PFE, or JISE based on the POC $p$ of the current frame. Once the last frame is decoded and enhanced, the STEW will slide forward by 8 POC distances.
  • Figure 4: Visual comparisons between the reconstructed frames generated by the anchor VTM-15.0 and our proposed method. The visual comparisons are conducted under two QP values (37 and 42). The top row showcases the $9^{th}$ frame of BasketballDrill, the middle row displays the $9^{th}$ frame of BQMall, and the bottom row features the $9^{th}$ frame of RaceHorses.
  • Figure 5: Four examples of RD curve on the sequences Tango2, BasketballDrive, BQMall, and BasketbalPass. All the sequences are encoded under RA configuration. Red curves denote the experimental results with the proposed method, while black curves represent the results on VTM-15.0.
  • ...and 1 more figures