Table of Contents
Fetching ...

Machines Serve Human: A Novel Variable Human-machine Collaborative Compression Framework

Zifu Zhang, Shengxi Li, Xiancheng Sun, Mai Xu, Zhengyuan Liu, Jingyuan Xia

TL;DR

The paper addresses the challenge of universal compression for both human and machine vision by starting from machine-vision features rather than human-centric representations. It introduces Diff-FCHM, a diffusion-prior based framework with a variable-rate feature compression network (VFCN) that uses implicit variable normalization, and a diffusion-guided human-vision module (HVCN) built on a fusion control network (FCN) and auxiliary compression network (ACN) to restore perceptual details. Key contributions include an input-level normalization strategy enabling variable-rate machine-vision compression without retraining, autoregressive aggregation of machine semantics with diffusion priors, and end-to-end rate-distortion optimization that yields substantial bitrate savings while preserving machine-task accuracy and human perceptual quality. Experiments on COCO and Kodak demonstrate large BD-rate savings and superior performance across both machine-vision and human-vision tasks, underscoring the method’s practical impact for dual-purpose data transmission and storage.

Abstract

Human-machine collaborative compression has been receiving increasing research efforts for reducing image/video data, serving as the basis for both human perception and machine intelligence. Existing collaborative methods are dominantly built upon the de facto human-vision compression pipeline, witnessing deficiency on complexity and bit-rates when aggregating the machine-vision compression. Indeed, machine vision solely focuses on the core regions within the image/video, requiring much less information compared with the compressed information for human vision. In this paper, we thus set out the first successful attempt by a novel collaborative compression method based on the machine-vision-oriented compression, instead of human-vision pipeline. In other words, machine vision serves as the basis for human vision within collaborative compression. A plug-and-play variable bit-rate strategy is also developed for machine vision tasks. Then, we propose to progressively aggregate the semantics from the machine-vision compression, whilst seamlessly tailing the diffusion prior to restore high-fidelity details for human vision, thus named as diffusion-prior based feature compression for human and machine visions (Diff-FCHM). Experimental results verify the consistently superior performances of our Diff-FCHM, on both machine-vision and human-vision compression with remarkable margins. Our code will be released upon acceptance.

Machines Serve Human: A Novel Variable Human-machine Collaborative Compression Framework

TL;DR

The paper addresses the challenge of universal compression for both human and machine vision by starting from machine-vision features rather than human-centric representations. It introduces Diff-FCHM, a diffusion-prior based framework with a variable-rate feature compression network (VFCN) that uses implicit variable normalization, and a diffusion-guided human-vision module (HVCN) built on a fusion control network (FCN) and auxiliary compression network (ACN) to restore perceptual details. Key contributions include an input-level normalization strategy enabling variable-rate machine-vision compression without retraining, autoregressive aggregation of machine semantics with diffusion priors, and end-to-end rate-distortion optimization that yields substantial bitrate savings while preserving machine-task accuracy and human perceptual quality. Experiments on COCO and Kodak demonstrate large BD-rate savings and superior performance across both machine-vision and human-vision tasks, underscoring the method’s practical impact for dual-purpose data transmission and storage.

Abstract

Human-machine collaborative compression has been receiving increasing research efforts for reducing image/video data, serving as the basis for both human perception and machine intelligence. Existing collaborative methods are dominantly built upon the de facto human-vision compression pipeline, witnessing deficiency on complexity and bit-rates when aggregating the machine-vision compression. Indeed, machine vision solely focuses on the core regions within the image/video, requiring much less information compared with the compressed information for human vision. In this paper, we thus set out the first successful attempt by a novel collaborative compression method based on the machine-vision-oriented compression, instead of human-vision pipeline. In other words, machine vision serves as the basis for human vision within collaborative compression. A plug-and-play variable bit-rate strategy is also developed for machine vision tasks. Then, we propose to progressively aggregate the semantics from the machine-vision compression, whilst seamlessly tailing the diffusion prior to restore high-fidelity details for human vision, thus named as diffusion-prior based feature compression for human and machine visions (Diff-FCHM). Experimental results verify the consistently superior performances of our Diff-FCHM, on both machine-vision and human-vision compression with remarkable margins. Our code will be released upon acceptance.

Paper Structure

This paper contains 20 sections, 17 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Illustration of existing human-machine collaborative compression paradigms, against our framework. Enc and Dec denote the encoder and decoder, Head and Tail refer to the initial and final blocks of the machine-vision network, FCN represents fusion control network, $\bm{x}$, $\hat{\bm{x}}$, and $\bm{T}$ denote the original image, reconstructed image, and downstream task results, respectively. Please note that $\hat{\bm{x}}$, and $\bm{T}$ correspond to compression for human vision and machine vision, respectively.
  • Figure 2: The overall architecture of our Diff-FCMH method, which first obtains machine-vision features from the task head. The machine-vision features are then compressed via our variable-rate feature compression network (VFCN), through implicit variable normalisation (IVN) and de-normalisation (IVDN) layer, enabling variable downstream task performance through the remaining task tail networks. The compressed machine-vision features then operate as the basis for the human vision compression network (HVCN), with the goal of achieving high-fidelity compression for human vision. This is achieved by our newly proposed fusion control network (FCN) module and auxiliary compression network (ACN) module, which progressively aggregate the diffusion prior and the noisy latent with the semantics from machine-vision features.
  • Figure 3: Visualization of feature distributions, bit allocation, object detection and instance segmentation results under four scaling factors. The first row illustrates the distribution of the scaled input $\bar{P}_2$ features, while the second row shows the distribution of the latent $\bm{y}_p$. The third row presents the bit allocation maps, computed by averaging the negative log-likelihood across channels. The final row displays the object detection and instance segmentation results corresponding to each scaling factor.
  • Figure 4: Rate-mAP curves of our method and five comparing baseline methods for detection and segmentation tasks.
  • Figure 5: Subjective results for machine vision tasks on the COCO dataset. Note that wrongly detected objects are annotated by the symbol $\star$. For correctly detected objects, a higher confidence score indicates a more reliable detection, whereas for incorrectly detected instances, a lower score reflects better suppression of false positives.
  • ...and 7 more figures