Table of Contents
Fetching ...

An Efficient Adaptive Compression Method for Human Perception and Machine Vision Tasks

Lei Liu, Zhenghao Chen, Zhihao Hu, Dong Xu

TL;DR

This work addresses the gap where neural image and video compression typically optimizes for human vision, limiting utility for machine vision tasks. It introduces Efficient Adaptive Compression (EAC), which combines an adaptive latent-feature partitioning mechanism with lightweight task-specific adapters to support multiple machine vision tasks while preserving human-view quality. EAC is designed to plug into existing NIC and NVC backbones, delivering notable bit-rate reductions and improved task performance on segmentation, detection, and related tasks across standard benchmarks. The paper supports its claims with extensive ablations, complexity analyses, and cross-task evaluations, establishing a new practical baseline for joint human-machine vision guided compression with potential impact on autonomous systems and large-scale video analytics.

Abstract

While most existing neural image compression (NIC) and neural video compression (NVC) methodologies have achieved remarkable success, their optimization is primarily focused on human visual perception. However, with the rapid development of artificial intelligence, many images and videos will be used for various machine vision tasks. Consequently, such existing compression methodologies cannot achieve competitive performance in machine vision. In this work, we introduce an efficient adaptive compression (EAC) method tailored for both human perception and multiple machine vision tasks. Our method involves two key modules: 1), an adaptive compression mechanism, that adaptively selects several subsets from latent features to balance the optimizations for multiple machine vision tasks (e.g., segmentation, and detection) and human vision. 2), a task-specific adapter, that uses the parameter-efficient delta-tuning strategy to stimulate the comprehensive downstream analytical networks for specific machine vision tasks. By using the above two modules, we can optimize the bit-rate costs and improve machine vision performance. In general, our proposed EAC can seamlessly integrate with existing NIC (i.e., Ballé2018, and Cheng2020) and NVC (i.e., DVC, and FVC) methods. Extensive evaluation on various benchmark datasets (i.e., VOC2007, ILSVRC2012, VOC2012, COCO, UCF101, and DAVIS) shows that our method enhances performance for multiple machine vision tasks while maintaining the quality of human vision.

An Efficient Adaptive Compression Method for Human Perception and Machine Vision Tasks

TL;DR

This work addresses the gap where neural image and video compression typically optimizes for human vision, limiting utility for machine vision tasks. It introduces Efficient Adaptive Compression (EAC), which combines an adaptive latent-feature partitioning mechanism with lightweight task-specific adapters to support multiple machine vision tasks while preserving human-view quality. EAC is designed to plug into existing NIC and NVC backbones, delivering notable bit-rate reductions and improved task performance on segmentation, detection, and related tasks across standard benchmarks. The paper supports its claims with extensive ablations, complexity analyses, and cross-task evaluations, establishing a new practical baseline for joint human-machine vision guided compression with potential impact on autonomous systems and large-scale video analytics.

Abstract

While most existing neural image compression (NIC) and neural video compression (NVC) methodologies have achieved remarkable success, their optimization is primarily focused on human visual perception. However, with the rapid development of artificial intelligence, many images and videos will be used for various machine vision tasks. Consequently, such existing compression methodologies cannot achieve competitive performance in machine vision. In this work, we introduce an efficient adaptive compression (EAC) method tailored for both human perception and multiple machine vision tasks. Our method involves two key modules: 1), an adaptive compression mechanism, that adaptively selects several subsets from latent features to balance the optimizations for multiple machine vision tasks (e.g., segmentation, and detection) and human vision. 2), a task-specific adapter, that uses the parameter-efficient delta-tuning strategy to stimulate the comprehensive downstream analytical networks for specific machine vision tasks. By using the above two modules, we can optimize the bit-rate costs and improve machine vision performance. In general, our proposed EAC can seamlessly integrate with existing NIC (i.e., Ballé2018, and Cheng2020) and NVC (i.e., DVC, and FVC) methods. Extensive evaluation on various benchmark datasets (i.e., VOC2007, ILSVRC2012, VOC2012, COCO, UCF101, and DAVIS) shows that our method enhances performance for multiple machine vision tasks while maintaining the quality of human vision.
Paper Structure (19 sections, 5 equations, 9 figures, 3 tables)

This paper contains 19 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: (a) The overview of our adaptive compression module, where we simultaneously balance the optimizations for the multiple machine and human vision tasks. The transmission module contains the arithmetic encoder and arithmetic decoder. (b) Details of the partitioning, which selects subsets from the quantized latent feature $\hat{y}$ for the various vision tasks. (c) Details of the reconstruction and aggregation modules. The reconstruction module reconstructs the quantized latent feature from 1D shape $\tilde{y}_i$ to 3D shape $\hat{y}_i$ using a predicted binary mask. The aggregation module fills unselected elements with their predicted mean value (i.e., $\mu$) for latent features.
  • Figure 2: The details of how we implement the adapter with the task-specific network. We optimize it by using a parameter-efficient delta-tuning strategy. For NIC, our adapter only adopts spatial information (i.e., reconstructed image) as input. For NVC, our adapter utilizes both spatial information (i.e., current reconstructed frame) and temporal information (i.e., multiple previous reconstructed frames).
  • Figure 3: The overview of "EAC (NIC)", where we incorporate our efficient adaptive compression method in neural image compression network.
  • Figure 4: (a) The overview of "EAC (NVC)", where we incorporate our efficient adaptive compression method in neural video compression network. (b) The details of $i$-th machine vision branch, $i\in\{1,2,...,n\}$. Given an input frame $X_t$ at current time-step $t$, we first estimate the motion $M_t$ between $X_t$ and reference frame $\hat{X}_{t-1}$, which is then compressed by an adaptive motion compression network to produce the reconstructed motions $\hat{M}_t^{mi}, i \in \{1,2,...,n\}$ for machine vision (resp., $\hat{M}_t$ for human vision). Then we will adopt the reconstructed motion to perform motion compensation and predict frame $\overline{X}_t^{mi}$ for machine (resp., $\overline{X}_t$ for human). Then we can compress the residual $R_t$, which is produced by subtracting predicted frame $\overline{X}_t$ from the current frame $X_t$, by using adaptive residual compression network to produce the residual $\hat{R}_t^{mi}$ for machine vision (resp., $\hat{R}_t$ for human vision). Last, the predicted frame $\overline{X}_t^{mi}$ (resp., $\overline{X}_t$) will be then added back to reconstructed residual $\hat{R}_t^{mi}$ (resp., $\hat{R}_t$) to generate the final reconstructed frame $\hat{X}_t^{mi}$ for machine vision (resp., $\hat{X}_t$ for human vision).
  • Figure 5: The multi-tasks (i.e., the segmentation task, the detection task, and the human vision) results for our "EAC (NIC)" compared to the baseline methods on the VOC2007 dataset. We report mIoU, mAP@0.5, and PSNR results. For all codecs, we use the PSANet as the segmentation network and the Faster R-CNN as the detection network. "Ours (Ballé2018)" and "Ours (Cheng2020)" denote our "EAC (NIC)" framework with Ballé2018 and Cheng2020 as image coding backbone, respectively. "codec+PSANet" and "codec+Faster R-CNN" denote we directly adopt the codec to compress the images and use such compressed image to perform the segmentation task and the detection task, respectively.
  • ...and 4 more figures