The Practice of Averaging Rate-Distortion Curves over Testsets to Compare Learned Video Codecs Can Cause Misleading Conclusions

M. Akin Yilmaz; Onur Keleş; A. Murat Tekalp

The Practice of Averaging Rate-Distortion Curves over Testsets to Compare Learned Video Codecs Can Cause Misleading Conclusions

M. Akin Yilmaz, Onur Keleş, A. Murat Tekalp

TL;DR

The paper identifies that averaging RD curves across test videos can bias comparisons of learned video codecs. It provides an analytical counterexample showing that averaging can mislead unless a specific condition $\Delta P_2 \cdot \Delta B_1 = \Delta P_1 \cdot \Delta B_2$ holds, and it validates the concern with UVG experiments comparing Li_CVPR2023 and Yilmaz_ICIP2024, where per-sequence BD-rates favor one codec but the averaged RD curve favors another. The authors show that factors like differing operating ranges and sequence characteristics can cause the average to misrepresent true performance, especially when one video dominates the average. They advocate reporting per-sequence RD curves and computing the average BD-rate as the mean of per-sequence BD-rates, aligning learned-codec evaluation with HEVC/VVC practice to ensure fair comparisons and reproducible conclusions.

Abstract

This paper aims to demonstrate how the prevalent practice in the learned video compression community of averaging rate-distortion (RD) curves across a test video set can lead to misleading conclusions in evaluating codec performance. Through analytical analysis of a simple case and experimental results with two recent learned video codecs, we show how averaged RD curves can mislead comparative evaluation of different codecs, particularly when videos in a dataset have varying characteristics and operating ranges. We illustrate how a single video with distinct RD characteristics from the rest of the test set can disproportionately influence the average RD curve, potentially overshadowing a codec's superior performance across most individual sequences. Using two recent learned video codecs on the UVG dataset as a case study, we demonstrate computing performance metrics, such as the BD rate, from the average RD curve suggests conclusions that contradict those reached from calculating the average of per-sequence metrics. Hence, we argue that the learned video compression community should also report per-sequence RD curves and performance metrics for a test set should be computed from the average of per-sequence metrics, similar to the established practice in traditional video coding, to ensure fair and accurate codec comparisons.

The Practice of Averaging Rate-Distortion Curves over Testsets to Compare Learned Video Codecs Can Cause Misleading Conclusions

TL;DR

holds, and it validates the concern with UVG experiments comparing Li_CVPR2023 and Yilmaz_ICIP2024, where per-sequence BD-rates favor one codec but the averaged RD curve favors another. The authors show that factors like differing operating ranges and sequence characteristics can cause the average to misrepresent true performance, especially when one video dominates the average. They advocate reporting per-sequence RD curves and computing the average BD-rate as the mean of per-sequence BD-rates, aligning learned-codec evaluation with HEVC/VVC practice to ensure fair comparisons and reproducible conclusions.

Abstract

Paper Structure (8 sections, 6 equations, 4 figures, 1 table)

This paper contains 8 sections, 6 equations, 4 figures, 1 table.

Introduction
Related work and Contributions
Why Averaging RD Curves Misleads Codec Assessment: Analysis for a Simple Case
Experimental Evidence
Experimental Setup
Analysis of RD Curve Averaging Effects
Impact of Individual Sequences
Conclusion

Figures (4)

Figure 1: Illustration of averaging RD curves in the case of two hypothetical codecs with linear RD curves, where the bitrate range of two codecs to encode video-2 do not fully overlap.
Figure 2: Rate-Distortion curves for the seven (7) videos in the UVGuvg dataset. The last graph is the average RD curve. Observe that ReadySetGo is the only video where the average RD curve for Li_CVPR2023 is on top of that of Yilmaz_ICIP2024.
Figure 3: Average RD curve for the UVG dataset if ReadySetGo is excluded. The average RD curve for Li_CVPR2023 is on top of that of Yilmaz_ICIP2024, although we removed the only video where the curve for Li_CVPR2023 is on top of that of Yilmaz_ICIP2024.
Figure 4: Average RD curve for the UVG dataset if Beauty is excluded. Excluding Beauty, where Yilmaz_ICIP2024 is superior, puts the RD curve for Yilmaz_ICIP2024 over that of Li_CVPR2023, which is inconsistent with other average RD curves.

The Practice of Averaging Rate-Distortion Curves over Testsets to Compare Learned Video Codecs Can Cause Misleading Conclusions

TL;DR

Abstract

The Practice of Averaging Rate-Distortion Curves over Testsets to Compare Learned Video Codecs Can Cause Misleading Conclusions

Authors

TL;DR

Abstract

Table of Contents

Figures (4)