The Practice of Averaging Rate-Distortion Curves over Testsets to Compare Learned Video Codecs Can Cause Misleading Conclusions
M. Akin Yilmaz, Onur Keleş, A. Murat Tekalp
TL;DR
The paper identifies that averaging RD curves across test videos can bias comparisons of learned video codecs. It provides an analytical counterexample showing that averaging can mislead unless a specific condition $\Delta P_2 \cdot \Delta B_1 = \Delta P_1 \cdot \Delta B_2$ holds, and it validates the concern with UVG experiments comparing Li_CVPR2023 and Yilmaz_ICIP2024, where per-sequence BD-rates favor one codec but the averaged RD curve favors another. The authors show that factors like differing operating ranges and sequence characteristics can cause the average to misrepresent true performance, especially when one video dominates the average. They advocate reporting per-sequence RD curves and computing the average BD-rate as the mean of per-sequence BD-rates, aligning learned-codec evaluation with HEVC/VVC practice to ensure fair comparisons and reproducible conclusions.
Abstract
This paper aims to demonstrate how the prevalent practice in the learned video compression community of averaging rate-distortion (RD) curves across a test video set can lead to misleading conclusions in evaluating codec performance. Through analytical analysis of a simple case and experimental results with two recent learned video codecs, we show how averaged RD curves can mislead comparative evaluation of different codecs, particularly when videos in a dataset have varying characteristics and operating ranges. We illustrate how a single video with distinct RD characteristics from the rest of the test set can disproportionately influence the average RD curve, potentially overshadowing a codec's superior performance across most individual sequences. Using two recent learned video codecs on the UVG dataset as a case study, we demonstrate computing performance metrics, such as the BD rate, from the average RD curve suggests conclusions that contradict those reached from calculating the average of per-sequence metrics. Hence, we argue that the learned video compression community should also report per-sequence RD curves and performance metrics for a test set should be computed from the average of per-sequence metrics, similar to the established practice in traditional video coding, to ensure fair and accurate codec comparisons.
