Machines Serve Human: A Novel Variable Human-machine Collaborative Compression Framework
Zifu Zhang, Shengxi Li, Xiancheng Sun, Mai Xu, Zhengyuan Liu, Jingyuan Xia
TL;DR
The paper addresses the challenge of universal compression for both human and machine vision by starting from machine-vision features rather than human-centric representations. It introduces Diff-FCHM, a diffusion-prior based framework with a variable-rate feature compression network (VFCN) that uses implicit variable normalization, and a diffusion-guided human-vision module (HVCN) built on a fusion control network (FCN) and auxiliary compression network (ACN) to restore perceptual details. Key contributions include an input-level normalization strategy enabling variable-rate machine-vision compression without retraining, autoregressive aggregation of machine semantics with diffusion priors, and end-to-end rate-distortion optimization that yields substantial bitrate savings while preserving machine-task accuracy and human perceptual quality. Experiments on COCO and Kodak demonstrate large BD-rate savings and superior performance across both machine-vision and human-vision tasks, underscoring the method’s practical impact for dual-purpose data transmission and storage.
Abstract
Human-machine collaborative compression has been receiving increasing research efforts for reducing image/video data, serving as the basis for both human perception and machine intelligence. Existing collaborative methods are dominantly built upon the de facto human-vision compression pipeline, witnessing deficiency on complexity and bit-rates when aggregating the machine-vision compression. Indeed, machine vision solely focuses on the core regions within the image/video, requiring much less information compared with the compressed information for human vision. In this paper, we thus set out the first successful attempt by a novel collaborative compression method based on the machine-vision-oriented compression, instead of human-vision pipeline. In other words, machine vision serves as the basis for human vision within collaborative compression. A plug-and-play variable bit-rate strategy is also developed for machine vision tasks. Then, we propose to progressively aggregate the semantics from the machine-vision compression, whilst seamlessly tailing the diffusion prior to restore high-fidelity details for human vision, thus named as diffusion-prior based feature compression for human and machine visions (Diff-FCHM). Experimental results verify the consistently superior performances of our Diff-FCHM, on both machine-vision and human-vision compression with remarkable margins. Our code will be released upon acceptance.
