Table of Contents
Fetching ...

FD-LSCIC: Frequency Decomposition-based Learned Screen Content Image Compression

Shiqi Jiang, Hui Yuan, Shuai Li, Huanqiang Zeng, Sam Kwong

TL;DR

This work targets screen content image compression by addressing three SC-specific challenges: compact latent feature learning, per-frequency quantization granularity, and limited large-scale SC data. It introduces FD-LSCIC, a frequency-decomposition LIC framework built on four components—MToRB for multi-frequency feature extraction, CTSFRB for multi-scale fusion, MFCIM for cross-frequency context interaction, and AQ for adaptive per-frequency quantization—and ships a large SDU-SCICD10K dataset (>10k images from PC/mobile). The method employs a VAE-based RD objective with per-frequency entropy models, achieving substantial BD-rate reductions relative to H.266/VVC and state-of-the-art LIC methods on SC datasets, alongside favorable complexity and qualitative results. Ablation studies confirm the contribution of each module, demonstrating that true multi-frequency processing and adaptive quantization materially improve SC compression performance and efficiency, with practical implications for SC-intensive applications.

Abstract

The learned image compression (LIC) methods have already surpassed traditional techniques in compressing natural scene (NS) images. However, directly applying these methods to screen content (SC) images, which possess distinct characteristics such as sharp edges, repetitive patterns, embedded text and graphics, yields suboptimal results. This paper addresses three key challenges in SC image compression: learning compact latent features, adapting quantization step sizes, and the lack of large SC datasets. To overcome these challenges, we propose a novel compression method that employs a multi-frequency two-stage octave residual block (MToRB) for feature extraction, a cascaded triple-scale feature fusion residual block (CTSFRB) for multi-scale feature integration and a multi-frequency context interaction module (MFCIM) to reduce inter-frequency correlations. Additionally, we introduce an adaptive quantization module that learns scaled uniform noise for each frequency component, enabling flexible control over quantization granularity. Furthermore, we construct a large SC image compression dataset (SDU-SCICD10K), which includes over 10,000 images spanning basic SC images, computer-rendered images, and mixed NS and SC images from both PC and mobile platforms. Experimental results demonstrate that our approach significantly improves SC image compression performance, outperforming traditional standards and state-of-the-art learning-based methods in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM).

FD-LSCIC: Frequency Decomposition-based Learned Screen Content Image Compression

TL;DR

This work targets screen content image compression by addressing three SC-specific challenges: compact latent feature learning, per-frequency quantization granularity, and limited large-scale SC data. It introduces FD-LSCIC, a frequency-decomposition LIC framework built on four components—MToRB for multi-frequency feature extraction, CTSFRB for multi-scale fusion, MFCIM for cross-frequency context interaction, and AQ for adaptive per-frequency quantization—and ships a large SDU-SCICD10K dataset (>10k images from PC/mobile). The method employs a VAE-based RD objective with per-frequency entropy models, achieving substantial BD-rate reductions relative to H.266/VVC and state-of-the-art LIC methods on SC datasets, alongside favorable complexity and qualitative results. Ablation studies confirm the contribution of each module, demonstrating that true multi-frequency processing and adaptive quantization materially improve SC compression performance and efficiency, with practical implications for SC-intensive applications.

Abstract

The learned image compression (LIC) methods have already surpassed traditional techniques in compressing natural scene (NS) images. However, directly applying these methods to screen content (SC) images, which possess distinct characteristics such as sharp edges, repetitive patterns, embedded text and graphics, yields suboptimal results. This paper addresses three key challenges in SC image compression: learning compact latent features, adapting quantization step sizes, and the lack of large SC datasets. To overcome these challenges, we propose a novel compression method that employs a multi-frequency two-stage octave residual block (MToRB) for feature extraction, a cascaded triple-scale feature fusion residual block (CTSFRB) for multi-scale feature integration and a multi-frequency context interaction module (MFCIM) to reduce inter-frequency correlations. Additionally, we introduce an adaptive quantization module that learns scaled uniform noise for each frequency component, enabling flexible control over quantization granularity. Furthermore, we construct a large SC image compression dataset (SDU-SCICD10K), which includes over 10,000 images spanning basic SC images, computer-rendered images, and mixed NS and SC images from both PC and mobile platforms. Experimental results demonstrate that our approach significantly improves SC image compression performance, outperforming traditional standards and state-of-the-art learning-based methods in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM).

Paper Structure

This paper contains 19 sections, 26 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Frequency characteristics of NS and SC images. The left column represents the original image, the middle column represents their corresponding high-frequency and low-frequency decompositions, and the right column represents the frequency spectrum distribution.
  • Figure 2: Illustration of OctConv-based frequency decomposition. (a) The original image features can be decomposed into a combination of multiple frequency components. (b) OctConv simplifies this by dividing the features into two parts: high-frequency in high-resolution tensors and low-frequency in low-resolution tensors. (c) The proposed multi-frequency octave convolution adds a middle-frequency component, where smoothly varying features are stored in lower-resolution tensors, sharply varying features in higher-resolution tensors, and the remaining features in intermediate-resolution tensors. (d) Different frequency components undergo intra-frequency information updates (purple arrows), information transfer from higher to lower frequencies (red arrows), and information transfer from lower to higher frequencies (gray arrows).
  • Figure 3: Down-sampling octave convolution and its variants, (a) OctConv chen2019drop, (b) GoConv akbari2021learned, (c) ToRB chen2022two, and (d) MToRB (proposed). ${\alpha}$ denotes the channel ratio allocated to low-frequency features, ${\beta}$ denotes the channel ratio allocated to mid-frequency features, and the channel ratio for high-frequency features is $1 - {\alpha} - {\beta}$.
  • Figure 4: Framework of LIC methods. $g_a$ and $g_s$ denote main encoder and decoder, respectively. $h_a$ and $h_s$ denote hyper encoder and decoder, respectively. $C_m$ denotes context model, $P_e$ denotes entropy parameter model, FEM denotes the factorized entropy model balle2018variational, $\mathrm{Q}$ denotes quantization module. The latent feature $\bm{y}$ (resp. hyper latent feature $\bm{z}$) is quantized as $\bm{\tilde{y}}$ (resp. $\bm{\tilde{z}}$) for training and $\bm{\hat{y}}$ (resp. $\bm{\hat{z}}$) for testing. AE and AD denote arithmetic encoding and arithmetic decoding, respectively.
  • Figure 5: Framework of the proposed method. The codec consists of a main encoder-decoder and a hyper encoder-decoder. $\mathrm{MToRB}$ denotes multi-frequency two-stage octave residual block, $\mathrm{CTSFRB}$ denotes cascaded triple-scale feature fusion residual block, $C_m$ denotes context module, WAM denotes window-based attention block Zou_2022_CVPR, $C_{h-l}$, $C_{h-m}$ and $C_{m-l}$ denote multi-frequency context interaction module (MFCIM). $P_h$, $P_m$ and $P_l$ denote entropy parameter network minnen2018joint for $\bm{y}^H$, $\bm{y}^M$ and $\bm{y}^L$. $\downarrow$ and $\uparrow$ denote downsampling and upsampling, respectively. The hyperprior $\bm{\mathit{\Psi}} ^H$, $\bm{\mathit{\Psi}} ^M$ and $\bm{\mathit{\Psi}} ^L$ are input into $h_{sl}$, $h_{sm}$, and $h_{sh}$ to obtain the noise interval parameters $\bm{\Delta}_H$, $\bm{\Delta}_M$ and $\bm{\Delta}_L$, which are then applied to the adaptive quantization modules $\mathrm{AQ}_h$, $\mathrm{AQ}_m$ and $\mathrm{AQ}_l$, respectively.
  • ...and 11 more figures