Competitive Learning for Achieving Content-specific Filters in Video Coding for Machines
Honglei Zhang, Jukka I. Ahonen, Nam Le, Ruiying Yang, Francesco Cricri
TL;DR
This work tackles content-dependent artifacts in codecs designed for humans by proposing a competitive-learning framework to jointly train multiple content-specific post-processing filters for machine vision tasks. It replaces hard assignment with a weighted loss and uses a softmax-based, temperature-controlled annealing scheme to dynamically allocate samples to filters, while employing an autoencoder-based filter architecture conditioned on quality indicators. Key findings show that jointly trained, content-aware filters achieve BD-rate reductions on object detection and instance segmentation tasks on the OpenImages dataset, with block-wise processing (notably 128×128 blocks) delivering the best gains (-42.3% detection, -44.7% segmentation). The results underscore the value of optimizing both content and reconstruction quality and point to practical gains for video coding for machines, with future work extending to conventional codecs.
Abstract
This paper investigates the efficacy of jointly optimizing content-specific post-processing filters to adapt a human oriented video/image codec into a codec suitable for machine vision tasks. By observing that artifacts produced by video/image codecs are content-dependent, we propose a novel training strategy based on competitive learning principles. This strategy assigns training samples to filters dynamically, in a fuzzy manner, which further optimizes the winning filter on the given sample. Inspired by simulated annealing optimization techniques, we employ a softmax function with a temperature variable as the weight allocation function to mitigate the effects of random initialization. Our evaluation, conducted on a system utilizing multiple post-processing filters within a Versatile Video Coding (VVC) codec framework, demonstrates the superiority of content-specific filters trained with our proposed strategies, specifically, when images are processed in blocks. Using VVC reference software VTM 12.0 as the anchor, experiments on the OpenImages dataset show an improvement in the BD-rate reduction from -41.3% and -44.6% to -42.3% and -44.7% for object detection and instance segmentation tasks, respectively, compared to independently trained filters. The statistics of the filter usage align with our hypothesis and underscore the importance of jointly optimizing filters for both content and reconstruction quality. Our findings pave the way for further improving the performance of video/image codecs.
