Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation

Han Li; Shaohui Li; Shuangrui Ding; Wenrui Dai; Maida Cao; Chenglin Li; Junni Zou; Hongkai Xiong

Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation

Han Li, Shaohui Li, Shuangrui Ding, Wenrui Dai, Maida Cao, Chenglin Li, Junni Zou, Hongkai Xiong

TL;DR

The paper addresses image compression for machine and human vision (ICMH) by reducing training and storage overhead when adapting pre-trained human-vision codecs to machine-vision tasks. It introduces Adapt-ICMH, a plug-and-play framework that inserts Spatial-Frequency Modulation Adapters (SFMA) after the encoder and decoder while freezing the base codec, and optimizes with a loss $\mathcal{L} = \mathcal{R} + \lambda \cdot \mathcal{D}(\mathbf{x}, \hat{\mathbf{x}}; \mathcal{G})$ to balance bitrate and task-perceptual distortion. SFMA combines a Spatial Modulation Adapter and a Frequency Modulation Adapter to suppress non-semantic spatial information and emphasize task-relevant frequencies, enabling efficient latent adaptation with only a small fraction of trainable parameters. Experiments across multiple LIC backbones and machine vision tasks demonstrate consistent rate-accuracy gains, reduced training overhead, and compatibility with diverse architectures, with qualitative and scalable-coding benefits highlighted.

Abstract

Image compression for machine and human vision (ICMH) has gained increasing attention in recent years. Existing ICMH methods are limited by high training and storage overheads due to heavy design of task-specific networks. To address this issue, in this paper, we develop a novel lightweight adapter-based tuning framework for ICMH, named Adapt-ICMH, that better balances task performance and bitrates with reduced overheads. We propose a spatial-frequency modulation adapter (SFMA) that simultaneously eliminates non-semantic redundancy with a spatial modulation adapter, and enhances task-relevant frequency components and suppresses task-irrelevant frequency components with a frequency modulation adapter. The proposed adapter is plug-and-play and compatible with almost all existing learned image compression models without compromising the performance of pre-trained models. Experiments demonstrate that Adapt-ICMH consistently outperforms existing ICMH frameworks on various machine vision tasks with fewer fine-tuned parameters and reduced computational complexity. Code will be released at https://github.com/qingshi9974/ECCV2024-AdpatICMH .

Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation

TL;DR

to balance bitrate and task-perceptual distortion. SFMA combines a Spatial Modulation Adapter and a Frequency Modulation Adapter to suppress non-semantic spatial information and emphasize task-relevant frequencies, enabling efficient latent adaptation with only a small fraction of trainable parameters. Experiments across multiple LIC backbones and machine vision tasks demonstrate consistent rate-accuracy gains, reduced training overhead, and compatibility with diverse architectures, with qualitative and scalable-coding benefits highlighted.

Abstract

Paper Structure (34 sections, 9 equations, 14 figures, 10 tables)

This paper contains 34 sections, 9 equations, 14 figures, 10 tables.

Introduction
Related Work
Methods
Empirical Findings by Full Fine-tuning
Framework Overview
Spatial-Frequency Modulation Adapter
Experiments
Training Details and Datasets.
Evaluation
Rate-Accuracy Comparison
Ablation study
Qualitative Results
Towards scalable coding for machine and human vision.
Conclusion
Acknowledgement
...and 19 more sections

Figures (14)

Figure 1: Left: our adapter-based tuning framework. Right: Rate-accuracy performance comparison on classification for ImageNet-valdeng2009imagenet. We compare our methods (Ours-$n$ indicates $n$ middle dimensions for SFMA) with full fine-tuning, TransTIC chen2023transtic, ICMH-Net liu2023icmh, and channel selection liu2022improving. BD-accuracy is computed by replacing the PSNR in BD-PSNR bdrate with top-1 accuracy and setting the base codec of TIC lu2022transformer as the anchor. The size of circles indicates GFLOPs for inference during encoding.
Figure 2: Visualization of the bit allocation maps (first row) and power spectral density maps (second row) of the latent $\hat{y}$. The left part shows the raw input image. Each column of the right part corresponds to a codec for each task, including the base codec for human vision and three fine-tuned codecs for machine vision tasks. The bit allocation map is computed by averaging the negative log-likelihood (i.e.,$-\log_2p(\hat{y})$) across channels. The power spectral density map is computed by applying the Fast Fourier Transform (FFT) to $\hat{y}$ with a shift operation to center the zero frequency component, and then averaging its absolute value across channels.
Figure 3: Overview of our proposed Adapt-ICMH framework. Multiple spatial-frequency modulation adapters (SFMA) are plugged into the encoder $g_a$ and decoder $g_s$ of the base codec for feature adaptation. During the adaptation to the machine vision task, the base codec is frozen and only these adapters are trainable. For briefness, we do not illustrate the specific architecture of the encoder, decoder stage, and entropy model, as it depends on the specific base codec. Please see Appendix E for the detailed architecture.
Figure 4: Rate-Accuracy performance comparison under different machine vision tasks and different base codecs.
Figure 5: Different variants of SFMA: (a) Proposed SFMA. (b) SMA-only.(c) FMA-only. (d) FMA-SMA-sequential. (e) SMA-FMA-sequential. (f) SMA-SMA-parallel.
...and 9 more figures

Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation

TL;DR

Abstract

Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)