Table of Contents
Fetching ...

S2CFormer: Revisiting the RD-Latency Trade-off in Transformer-based Learned Image Compression

Yunuo Chen, Qian Li, Bing He, Donghui Feng, Ronghua Wu, Qi Wang, Li Song, Guo Lu, Wenjun Zhang

TL;DR

The paper tackles the RD-latency trade-off in transformer-based learned image compression by shifting focus from complex spatial interactions to efficient channel aggregation. It introduces the S2CFormer paradigm, combining simplified spatial paths (Separable Conv or window Attention) with FFN-based channel aggregation, and demonstrates that channel aggregation is the primary driver of RD performance. Through S2C-Identity, S2C-Conv, and S2C-Attention variants, it achieves state-of-the-art RD with significantly faster decoding, and the S2C-Hybrid variant further optimizes the performance–latency trade-off by stage-wise combining different instantiations. The results establish new benchmarks on Kodak, Tecnick, and CLIC datasets and highlight the potential of advanced FFN structures for LIC, offering a practical path toward highly efficient, high-performance LIC systems.

Abstract

Transformer-based Learned Image Compression (LIC) suffers from a suboptimal trade-off between decoding latency and rate-distortion (R-D) performance. Moreover, the critical role of the FeedForward Network (FFN)-based channel aggregation module has been largely overlooked. Our research reveals that efficient channel aggregation-rather than complex and time-consuming spatial operations-is the key to achieving competitive LIC models. Based on this insight, we initiate the ``S2CFormer'' paradigm, a general architecture that simplifies spatial operations and enhances channel operations to overcome the previous trade-off. We present two instances of the S2CFormer: S2C-Conv, and S2C-Attention. Both models demonstrate state-of-the-art (SOTA) R-D performance and significantly faster decoding speed. Furthermore, we introduce S2C-Hybrid, an enhanced variant that maximizes the strengths of different S2CFormer instances to achieve a better performance-latency trade-off. This model outperforms all the existing methods on the Kodak, Tecnick, and CLIC Professional Validation datasets, setting a new benchmark for efficient and high-performance LIC. The code is at \href{https://github.com/YunuoChen/S2CFormer}{https://github.com/YunuoChen/S2CFormer}.

S2CFormer: Revisiting the RD-Latency Trade-off in Transformer-based Learned Image Compression

TL;DR

The paper tackles the RD-latency trade-off in transformer-based learned image compression by shifting focus from complex spatial interactions to efficient channel aggregation. It introduces the S2CFormer paradigm, combining simplified spatial paths (Separable Conv or window Attention) with FFN-based channel aggregation, and demonstrates that channel aggregation is the primary driver of RD performance. Through S2C-Identity, S2C-Conv, and S2C-Attention variants, it achieves state-of-the-art RD with significantly faster decoding, and the S2C-Hybrid variant further optimizes the performance–latency trade-off by stage-wise combining different instantiations. The results establish new benchmarks on Kodak, Tecnick, and CLIC datasets and highlight the potential of advanced FFN structures for LIC, offering a practical path toward highly efficient, high-performance LIC systems.

Abstract

Transformer-based Learned Image Compression (LIC) suffers from a suboptimal trade-off between decoding latency and rate-distortion (R-D) performance. Moreover, the critical role of the FeedForward Network (FFN)-based channel aggregation module has been largely overlooked. Our research reveals that efficient channel aggregation-rather than complex and time-consuming spatial operations-is the key to achieving competitive LIC models. Based on this insight, we initiate the ``S2CFormer'' paradigm, a general architecture that simplifies spatial operations and enhances channel operations to overcome the previous trade-off. We present two instances of the S2CFormer: S2C-Conv, and S2C-Attention. Both models demonstrate state-of-the-art (SOTA) R-D performance and significantly faster decoding speed. Furthermore, we introduce S2C-Hybrid, an enhanced variant that maximizes the strengths of different S2CFormer instances to achieve a better performance-latency trade-off. This model outperforms all the existing methods on the Kodak, Tecnick, and CLIC Professional Validation datasets, setting a new benchmark for efficient and high-performance LIC. The code is at \href{https://github.com/YunuoChen/S2CFormer}{https://github.com/YunuoChen/S2CFormer}.

Paper Structure

This paper contains 20 sections, 9 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: S2CFormer and the performance of S2CFormer-based models. The general structure of our S2CFormer is shown in (a). It consists of two key components: the Spatial Interaction module and the Channel Aggregation module. S2CFormer functions as nonlinear transform blocks for Learned Image Compression (LIC). Our analysis reveals that the competence of transformer-based LIC models primarily stems from channel aggregation. Building on this insight, we propose a novel design strategy that rebalances these two modules to achieve a more favorable trade-off between compression performance and decoding latency. As illustrated in (b), the data points for S2CFormer-based models exhibit a linear trend with a steeper slope, thereby underscoring their superior performance–latency characteristics.
  • Figure 2: Comparison of execution times for spatial interaction and channel aggregation across different models. Previous methods show much higher spatial interaction times than channel aggregation, causing significant delays. Our S2CFormer effectively rebalances the time relationship between these two modules.
  • Figure 3: Overview of S2CFormer-based LIC model. We adopt the basic VAE structure from minnen2018jointballe2018variational and integrate the SCCTX entropy model from he2022elic. The hierarchical architecture consists of five stages of nonlinear transform blocks. Each stage contains $L_i$ S2CFormer blocks. The general S2CFormer architecture is shown in (a), and (b-d) illustrate three S2CFormer instances. $L_1$-$L_6$ and $C_1$-$C_6$ represent block numbers and channel numbers for each stage, respectively
  • Figure 4: Vanilla FFN (a) and Advanced FFNs (b-c)
  • Figure 6: The effective receptive fields (ERF) luo2016understanding calculated by different models. $\text{"CA"}$ refers to Channel Aggregation module.
  • ...and 10 more figures