Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Sha Guo; Zhuo Chen; Yang Zhao; Ning Zhang; Xiaotong Li; Lingyu Duan

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Sha Guo, Zhuo Chen, Yang Zhao, Ning Zhang, Xiaotong Li, Lingyu Duan

TL;DR

This paper addresses the need for scalable image compression that preserves texture for human perception and semantic content for machine vision. It proposes a content-adaptive diffusion-based framework that blends texture-semantic pseudo-label extraction with a Markov palette diffusion model to encode latent features in a scalable, rate-perception controlled manner. Key contributions include (i) a Markov palette diffusion method with hierarchical clustering for latent-space compression, (ii) a texture-semantic representation learned via contrastive learning on pseudo-labels, and (iii) a reversible forward-backward diffusion mechanism enabling flexible operating points without retraining. Experiments on COCO and FFHQ show superior perceptual quality at low bitrates while preserving downstream machine vision performance across object detection, segmentation, and facial landmark tasks.

Abstract

Traditional image codecs emphasize signal fidelity and human perception, often at the expense of machine vision tasks. Deep learning methods have demonstrated promising coding performance by utilizing rich semantic embeddings optimized for both human and machine vision. However, these compact embeddings struggle to capture fine details such as contours and textures, resulting in imperfect reconstructions. Furthermore, existing learning-based codecs lack scalability. To address these limitations, this paper introduces a content-adaptive diffusion model for scalable image compression. The proposed method encodes fine textures through a diffusion process, enhancing perceptual quality while preserving essential features for machine vision tasks. The approach employs a Markov palette diffusion model combined with widely used feature extractors and image generators, enabling efficient data compression. By leveraging collaborative texture-semantic feature extraction and pseudo-label generation, the method accurately captures texture information. A content-adaptive Markov palette diffusion model is then applied to represent both low-level textures and high-level semantic content in a scalable manner. This framework offers flexible control over compression ratios by selecting intermediate diffusion states, eliminating the need for retraining deep learning models at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in both image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection, achieving superior perceptual quality compared to state-of-the-art methods.

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

TL;DR

Abstract

Paper Structure (26 sections, 13 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 13 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Proposed Method
Texture-Semantic Pseudo-Label Extraction
Extraction of Texture-Semantic Representation
Perceptual distance measurement
Pseudo-label Generation and Contrastive Learning
Diffusion-Based Image Feature Compression
Forward Process of Diffusion
Reverse Process of Diffusion
Training Objectives
Experiments
Datasets and Settings
Evaluation for Human Vision
Evaluation for Machine Vision
...and 11 more sections

Figures (8)

Figure 1: Our feature compression-transmission-decode-analysis paradigm: Features are extracted and compressed at front-end devices according to user-defined compression rates, with decompression and vision tasks carried out at the server side.
Figure 2: The VGG simonyan2014very decomposition of the "zebra" image: (a) Original image. (b)-(f) represent feature maps of $conv1^{(2)}$, $conv2^{(2)}$, $conv3^{(3)}$, $conv4^{(3)}$, $conv5^{(3)}$ with their Fast Fourier Transform (FFT) bracewell1986fourier analysis.
Figure 3: Overview of our approach: (a) Compress the original image $x$ into a latent space $z$. (b) Extract fine-texture to coarse-semantic information of images and pseudo-labeling them in a self-supervised manner (Section \ref{['subsubsection3-1-1']} to \ref{['subsubsection3-1-2']}). (c) Enhance features via contrastive learning (Section \ref{['subsubsection3-1-3']}). (d) Constructing a Markov diffusion process of bitrate-perception for scalable encoding features(Section \ref{['subsection3-2']}). (e) Decode features and constructs an optimized estimation $\hat{x}$ of the original image $x$.
Figure 4: Clustering number and distortion curves on COCO 2017 lin2014microsoft and FFHQ karras2019style, and clustering results visualization.
Figure 5: An example of using the hierarchical clustering method to construct a palette compression.
...and 3 more figures

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

TL;DR

Abstract

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (8)