Improving the Efficiency of VVC using Partitioning of Reference Frames
Kamran Qureshi, Hadi Amirpour, Christian Timmerer
TL;DR
This work addresses the encoding-time burden of Versatile Video Coding (VVC) by introducing Early Termination using Reference Frames (ETRF), an intermediate VVenC preset that leverages partitioning patterns from reference frames in lower temporal layers to prune the rate-distortion (RD) search space. ETRF builds an $8×8$ reference-CU map and uses a dynamic max_sz per CTU with a depth margin that allows up to two depths below max_sz, applying only to temporal layers $2$ through $5$ to maintain bit-rate targets while reducing computation. Evaluated on the JVET CTC and Inter4K datasets with four QPs, ETRF achieves around a $21\%$ reduction in encoding time compared to the medium preset and improves the overall efficiency versus the fast preset (e.g., $0.18$ vs $0.31$ in the $\frac{BDBR}{BDT}$ metric), particularly for high-complexity content, making it a practical mid-range option for real-world VVC deployments.
Abstract
In response to the growing demand for high-quality videos, Versatile Video Coding (VVC) was released in 2020, building on the hybrid coding architecture of its predecessor, HEVC, achieving about 50% bitrate reduction for the same visual quality. It introduces more flexible block partitioning, enhancing compression efficiency at the cost of increased encoding complexity. To make efficient use of VVC in practical applications, optimization is essential. VVenC, an optimized open-source VVC encoder, introduces multiple presets to address the trade-off between compression efficiency and encoder complexity. Although an optimized set of encoding tools has been selected for each preset, the rate-distortion (RD) search space in the encoder presets still poses a challenge for efficient encoder implementations. In this paper, we propose Early Termination using Reference Frames (ETRF), which improves the trade-off between encoding efficiency and time complexity and positions itself as a new preset between medium and fast presets. The CTU partitioning map of the reference frames in lower temporal layers is employed to accelerate the encoding of frames in higher temporal layers. The results show a reduction in the encoding time of around 21% compared to the medium preset. Specifically, for videos with high spatial and temporal complexities, which typically require longer encoding times, the proposed method achieves a better trade-off between bitrate savings and encoding time compared to the fast preset.
