Table of Contents
Fetching ...

A Preprocessing Framework for Video Machine Vision under Compression

Fei Zhao, Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang, Xiaodong Xie

TL;DR

The proposed video preprocessing framework incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance, and introduces a differentiable virtual codec to provide constraints on rate and distortion during the training stage.

Abstract

There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics, overlooking the heightened demands posed by machine vision systems. In this paper, we propose a video preprocessing framework tailored for machine vision tasks to address this challenge. The proposed method incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance. We further introduce a differentiable virtual codec to provide constraints on rate and distortion during the training stage. We directly apply widely used standard codecs for testing. Therefore, our solution can be easily applied to real-world scenarios. We conducted extensive experiments evaluating our compression method on two typical downstream tasks with various backbone networks. The experimental results indicate that our approach can save over 15% of bitrate compared to using only the standard codec anchor version.

A Preprocessing Framework for Video Machine Vision under Compression

TL;DR

The proposed video preprocessing framework incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance, and introduces a differentiable virtual codec to provide constraints on rate and distortion during the training stage.

Abstract

There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics, overlooking the heightened demands posed by machine vision systems. In this paper, we propose a video preprocessing framework tailored for machine vision tasks to address this challenge. The proposed method incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance. We further introduce a differentiable virtual codec to provide constraints on rate and distortion during the training stage. We directly apply widely used standard codecs for testing. Therefore, our solution can be easily applied to real-world scenarios. We conducted extensive experiments evaluating our compression method on two typical downstream tasks with various backbone networks. The experimental results indicate that our approach can save over 15% of bitrate compared to using only the standard codec anchor version.

Paper Structure

This paper contains 15 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The overall pipeline of the proposed framework. During the training phase, we employ the virtual codec to generate the necessary distortion loss and rate loss for supervision, while accuracy loss is generated by the vision task. In the testing phase, we encode the videos processed by the preprocessor using a real standard codec to obtain the bitrate. We also collect the corresponding accuracy metrics from the downstream vision task to assess the overall performance.
  • Figure 2: The structure of the proposed preprocessor can be divided into two primary branches, each responsible for extracting temporal and spatial features from the input video and subsequently merging them.
  • Figure 3: The proposed virtual codec emulates the fundamental logic used in video encoding and implements it through tensor operations. During the training phase, the virtual codec is responsible for providing rate loss and distortion loss.
  • Figure 4: Rate-accuracy illustration plots for test results of video action recognition(a) and video object tracking(b).
  • Figure 5: Visualization examples for proposed method outputs. (a) illustrates sample results for video object tracking, featuring frames from the Test000004 in the GOT-10k dataset. (b) presents sample results for video action recognition, showcasing images from the testing sequence 0u4c8Cel91U in the Kinetics400 dataset.