Task-Aware Encoder Control for Deep Video Compression

Xingtong Ge; Jixiang Luo; Xinjie Zhang; Tongda Xu; Guo Lu; Dailan He; Jing Geng; Yan Wang; Jun Zhang; Hongwei Qin

Task-Aware Encoder Control for Deep Video Compression

Xingtong Ge, Jixiang Luo, Xinjie Zhang, Tongda Xu, Guo Lu, Dailan He, Jing Geng, Yan Wang, Jun Zhang, Hongwei Qin

TL;DR

This work tackles the inefficiency of applying a single DVC codec to multiple machine vision tasks by introducing encoder-side control that preserves compatibility with a fixed pre-trained decoder. The approach combines Dynamic Vision Mode Prediction ($DVMP$) and a GoP structure predictor (GoP Selection) to create machine-focused $P_m$ frames and adaptive GoP configurations, enabling task-aware bitrate reductions while maintaining human-viewing ability when needed. Key contributions include a task-aware encoding framework compatible with existing decoders, a hyperprior-based DVMP with Gumbel Softmax for per-element masking, and a GoP adaptation mechanism that dynamically balances bitrate and downstream task performance, demonstrated across MOT, VOD, and VAR benchmarks with substantial bitrate savings. The method offers practical impact by reducing bandwidth and deployment complexity for cloud-based machine vision pipelines using learned codecs, without retraining or altering decoders.

Abstract

Prior research on deep video compression (DVC) for machine tasks typically necessitates training a unique codec for each specific task, mandating a dedicated decoder per task. In contrast, traditional video codecs employ a flexible encoder controller, enabling the adaptation of a single codec to different tasks through mechanisms like mode prediction. Drawing inspiration from this, we introduce an innovative encoder controller for deep video compression for machines. This controller features a mode prediction and a Group of Pictures (GoP) selection module. Our approach centralizes control at the encoding stage, allowing for adaptable encoder adjustments across different tasks, such as detection and tracking, while maintaining compatibility with a standard pre-trained DVC decoder. Empirical evidence demonstrates that our method is applicable across multiple tasks with various existing pre-trained DVCs. Moreover, extensive experiments demonstrate that our method outperforms previous DVC by about 25% bitrate for different tasks, with only one pre-trained decoder.

Task-Aware Encoder Control for Deep Video Compression

TL;DR

) and a GoP structure predictor (GoP Selection) to create machine-focused

frames and adaptive GoP configurations, enabling task-aware bitrate reductions while maintaining human-viewing ability when needed. Key contributions include a task-aware encoding framework compatible with existing decoders, a hyperprior-based DVMP with Gumbel Softmax for per-element masking, and a GoP adaptation mechanism that dynamically balances bitrate and downstream task performance, demonstrated across MOT, VOD, and VAR benchmarks with substantial bitrate savings. The method offers practical impact by reducing bandwidth and deployment complexity for cloud-based machine vision pipelines using learned codecs, without retraining or altering decoders.

Abstract

Paper Structure (16 sections, 3 equations, 9 figures, 2 tables)

This paper contains 16 sections, 3 equations, 9 figures, 2 tables.

Introduction
Related Works
Video Compression
Compression for Machine Vision
Compressed Video Analysis/Understanding
Method
Overview
Dynamic Vision Mode Prediction
DivGoP → GoP Selection
Experiments
Multi-Object Tracking
Video Object Detection
Video Action Recognition
Video Reconstruction Quality
Ablation Study
...and 1 more sections

Figures (9)

Figure 1: (a) Mainstream video codec that serves the human viewing. (b) Our controlled video codec for machine vision with fixed decoder. (c) Other video codecs for machine vison with one-to-one encoders and decoders.
Figure 2: (a) Overview of our "Controlling DVC for Machine" framwork. Given an input GoP, we firstly use GoP Selection network to predict the GoP sructure, then the predicted structure controls the encoding procedure to encode input frames for machine vision tasks. (b) The "0" element controls encoder to use DVMP. (c) The GoP Selection network, including the pre-analysis stage and GoP prediction stage.
Figure 3: Hyper-prior guided Dynamic Vision Mode Prediction network.
Figure 4: Different coding GoP structures. The original structure consists of I and P frames. In the middle, the hand-crafted structure for machine vision consists of I, P and $P_m$ frames which are arranged alternately. In the predicted (GoP selected) structure, the type of frames are predicted by the GoP selection network, targeting on better bitrate and machine vision performance trade-off.
Figure 5: Left: Using DFS to search for the near-optimal GoP structure for Bpp-mAP. Right: Results of simply fine-tuning FVC
...and 4 more figures

Task-Aware Encoder Control for Deep Video Compression

TL;DR

Abstract

Task-Aware Encoder Control for Deep Video Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (9)