Standard compliant video coding using low complexity, switchable neural wrappers

Yueyu Hu; Chenhao Zhang; Onur G. Guleryuz; Debargha Mukherjee; Yao Wang

Standard compliant video coding using low complexity, switchable neural wrappers

Yueyu Hu, Chenhao Zhang, Onur G. Guleryuz, Debargha Mukherjee, Yao Wang

TL;DR

The paper tackles the bottleneck of deploying neural video codecs by proposing a standard-compatible framework that wraps a conventional codec with switchable neural pre- and post-processors. It introduces a low-complexity neural post-processor (516 MACs/pixel) and jointly optimizes it with a neural pre-processor using a differentiable codec proxy to enforce rate constraints, signaling the optimal downsampling ratio $r$ per sequence. Empirical results on UVG and AOM CTC show BD-rate reductions up to $-22.6\%$ over HEVC and $-9.3\%$ over VVC, with decoding times suitable for consumer hardware (e.g., $7.7$ ms per 1080p frame on a mid-range GPU). The approach demonstrates practical gains with minimal added decoding complexity, suggesting a viable path for practical neural tools in next-generation standards. All mathematical relationships are expressed with appropriate $...$ delimiters to ensure precise interpretation.

Abstract

The proliferation of high resolution videos posts great storage and bandwidth pressure on cloud video services, driving the development of next-generation video codecs. Despite great progress made in neural video coding, existing approaches are still far from economical deployment considering the complexity and rate-distortion performance tradeoff. To clear the roadblocks for neural video coding, in this paper we propose a new framework featuring standard compatibility, high performance, and low decoding complexity. We employ a set of jointly optimized neural pre- and post-processors, wrapping a standard video codec, to encode videos at different resolutions. The rate-distorion optimal downsampling ratio is signaled to the decoder at the per-sequence level for each target rate. We design a low complexity neural post-processor architecture that can handle different upsampling ratios. The change of resolution exploits the spatial redundancy in high-resolution videos, while the neural wrapper further achieves rate-distortion performance improvement through end-to-end optimization with a codec proxy. Our light-weight post-processor architecture has a complexity of 516 MACs / pixel, and achieves 9.3% BD-Rate reduction over VVC on the UVG dataset, and 6.4% on AOM CTC Class A1. Our approach has the potential to further advance the performance of the latest video coding standards using neural processing with minimal added complexity.

Standard compliant video coding using low complexity, switchable neural wrappers

TL;DR

per sequence. Empirical results on UVG and AOM CTC show BD-rate reductions up to

over HEVC and

over VVC, with decoding times suitable for consumer hardware (e.g.,

ms per 1080p frame on a mid-range GPU). The approach demonstrates practical gains with minimal added decoding complexity, suggesting a viable path for practical neural tools in next-generation standards. All mathematical relationships are expressed with appropriate

delimiters to ensure precise interpretation.

Abstract

Paper Structure (14 sections, 2 equations, 6 figures, 2 tables)

This paper contains 14 sections, 2 equations, 6 figures, 2 tables.

Introduction
Related Works
Neural Processing for Video Coding
End-to-End Learned Video Compression
Proposed Method
Coding Framework
Efficient Post-Processor
Jointly Optimize Neural Wrapper with Codec Proxy
Training
Experiments
Settings
Effectiveness of Neural Wrappers
Qualitative Analysis
Conclusion

Figures (6)

Figure 1: Overall encoding and decoding process with the proposed scheme: A standard codec is wrapped by a neural wrapper with switchable weights.
Figure 2: Extended JPEG Proxy for end-to-end training: The input $X$ is the 3-channel frame produced by the preprocessor.
Figure 3: Structure of efficient post-processor. Layers are labeled with format [kernel size], (input / output channels) or (input $\rightarrow$ output channel) if there are different channel numbers for the input and output.
Figure 4: Illustration of Pareto Frontier by combining R-D curves resulting from all downsampling ratios on two typical sequences in the UVG dataset (only postprocessor is used).
Figure 5: Rate-distortion curves averaged over all videos in the UVG dataset. We achieve -22.6% BD-Rate over HEVC and -9.3% over VVC.
...and 1 more figures

Standard compliant video coding using low complexity, switchable neural wrappers

TL;DR

Abstract

Standard compliant video coding using low complexity, switchable neural wrappers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)