Task-Aware Encoder Control for Deep Video Compression
Xingtong Ge, Jixiang Luo, Xinjie Zhang, Tongda Xu, Guo Lu, Dailan He, Jing Geng, Yan Wang, Jun Zhang, Hongwei Qin
TL;DR
This work tackles the inefficiency of applying a single DVC codec to multiple machine vision tasks by introducing encoder-side control that preserves compatibility with a fixed pre-trained decoder. The approach combines Dynamic Vision Mode Prediction ($DVMP$) and a GoP structure predictor (GoP Selection) to create machine-focused $P_m$ frames and adaptive GoP configurations, enabling task-aware bitrate reductions while maintaining human-viewing ability when needed. Key contributions include a task-aware encoding framework compatible with existing decoders, a hyperprior-based DVMP with Gumbel Softmax for per-element masking, and a GoP adaptation mechanism that dynamically balances bitrate and downstream task performance, demonstrated across MOT, VOD, and VAR benchmarks with substantial bitrate savings. The method offers practical impact by reducing bandwidth and deployment complexity for cloud-based machine vision pipelines using learned codecs, without retraining or altering decoders.
Abstract
Prior research on deep video compression (DVC) for machine tasks typically necessitates training a unique codec for each specific task, mandating a dedicated decoder per task. In contrast, traditional video codecs employ a flexible encoder controller, enabling the adaptation of a single codec to different tasks through mechanisms like mode prediction. Drawing inspiration from this, we introduce an innovative encoder controller for deep video compression for machines. This controller features a mode prediction and a Group of Pictures (GoP) selection module. Our approach centralizes control at the encoding stage, allowing for adaptable encoder adjustments across different tasks, such as detection and tracking, while maintaining compatibility with a standard pre-trained DVC decoder. Empirical evidence demonstrates that our method is applicable across multiple tasks with various existing pre-trained DVCs. Moreover, extensive experiments demonstrate that our method outperforms previous DVC by about 25% bitrate for different tasks, with only one pre-trained decoder.
