Table of Contents
Fetching ...

FEDS: Feature and Entropy-Based Distillation Strategy for Efficient Learned Image Compression

Haisheng Fu, Jie Liang, Zhenman Fang, Jingning Han

TL;DR

This work tackles the practical burden of learned image compression by introducing FEDS, a feature and entropy-based distillation framework that transfers knowledge from a high-capacity Swin-V2–augmented teacher to a compact student. It combines feature alignment with an entropy-driven selection of latent channels, implemented within a three-phase training regime to preserve performance while dramatically reducing parameters and speeding encoding/decoding. Empirical results across Kodak, Tecnick, and CLIC demonstrate that the student nearly matches the teacher with only a small BD-Rate gap, while achieving roughly a 2.9× speedup and a ~63% reduction in parameters, making LIC more viable for real-time and resource-constrained settings. The approach is shown to generalize to transformer-based architectures and is supported by an information-theoretic justification linking mutual information, entropy, and KL-based feature transfer.

Abstract

Learned image compression (LIC) methods have recently outperformed traditional codecs such as VVC in rate-distortion performance. However, their large models and high computational costs have limited their practical adoption. In this paper, we first construct a high-capacity teacher model by integrating Swin-Transformer V2-based attention modules, additional residual blocks, and expanded latent channels, thus achieving enhanced compression performance. Building on this foundation, we propose a \underline{F}eature and \underline{E}ntropy-based \underline{D}istillation \underline{S}trategy (\textbf{FEDS}) that transfers key knowledge from the teacher to a lightweight student model. Specifically, we align intermediate feature representations and emphasize the most informative latent channels through an entropy-based loss. A staged training scheme refines this transfer in three phases: feature alignment, channel-level distillation, and final fine-tuning. Our student model nearly matches the teacher across Kodak (1.24\% BD-Rate increase), Tecnick (1.17\%), and CLIC (0.55\%) while cutting parameters by about 63\% and accelerating encoding/decoding by around 73\%. Moreover, ablation studies indicate that FEDS generalizes effectively to transformer-based networks. The experimental results demonstrate our approach strikes a compelling balance among compression performance, speed, and model parameters, making it well-suited for real-time or resource-limited scenarios.

FEDS: Feature and Entropy-Based Distillation Strategy for Efficient Learned Image Compression

TL;DR

This work tackles the practical burden of learned image compression by introducing FEDS, a feature and entropy-based distillation framework that transfers knowledge from a high-capacity Swin-V2–augmented teacher to a compact student. It combines feature alignment with an entropy-driven selection of latent channels, implemented within a three-phase training regime to preserve performance while dramatically reducing parameters and speeding encoding/decoding. Empirical results across Kodak, Tecnick, and CLIC demonstrate that the student nearly matches the teacher with only a small BD-Rate gap, while achieving roughly a 2.9× speedup and a ~63% reduction in parameters, making LIC more viable for real-time and resource-constrained settings. The approach is shown to generalize to transformer-based architectures and is supported by an information-theoretic justification linking mutual information, entropy, and KL-based feature transfer.

Abstract

Learned image compression (LIC) methods have recently outperformed traditional codecs such as VVC in rate-distortion performance. However, their large models and high computational costs have limited their practical adoption. In this paper, we first construct a high-capacity teacher model by integrating Swin-Transformer V2-based attention modules, additional residual blocks, and expanded latent channels, thus achieving enhanced compression performance. Building on this foundation, we propose a \underline{F}eature and \underline{E}ntropy-based \underline{D}istillation \underline{S}trategy (\textbf{FEDS}) that transfers key knowledge from the teacher to a lightweight student model. Specifically, we align intermediate feature representations and emphasize the most informative latent channels through an entropy-based loss. A staged training scheme refines this transfer in three phases: feature alignment, channel-level distillation, and final fine-tuning. Our student model nearly matches the teacher across Kodak (1.24\% BD-Rate increase), Tecnick (1.17\%), and CLIC (0.55\%) while cutting parameters by about 63\% and accelerating encoding/decoding by around 73\%. Moreover, ablation studies indicate that FEDS generalizes effectively to transformer-based networks. The experimental results demonstrate our approach strikes a compelling balance among compression performance, speed, and model parameters, making it well-suited for real-time or resource-limited scenarios.

Paper Structure

This paper contains 32 sections, 16 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The decoding time and BD-Rate saving over H.266/VVC of different methods for the Kodak dataset. Our method achieves the best trade-off among the three metrics. The upper-left corner has better results. Note that the unit of the left subfigure is milliseconds, while the unit of the right subfigure is seconds. Additionally, the area of each circle represents the model size in terms of the number of parameters.
  • Figure 2: The overall architecture of the proposed teacher network. $Q$ denote quantization module. The symbols $\uparrow$ and $\downarrow$ indicate up-sampling and down-sampling operations, while $3 \times 3$ refers to the convolutional kernel size. $AE$ and $AD$ represent the arithmetic encoder and decoder. The dotted lines illustrate shortcut connections with modified tensor dimensions. $ChARM$ represents he channel-wise auto-regressive entropy model.
  • Figure 3: The knowledge distillation framework between the teacher and student networks.
  • Figure 4: The R-D curves of different methods in terms of PSNR and MS-SSIM on the Kodak dataset Kodak.
  • Figure 5: The R-D curves of different methods in terms of PSNR on the Tecnick-100 Tecnick and CLIC-2021-test CLIC_test_2021 datasets.
  • ...and 6 more figures