Table of Contents
Fetching ...

HiDE: Hierarchical Dictionary-Based Entropy Modeling for Learned Image Compression

Haoxuan Xiong, Yuanyuan Xu, Kun Zhu, Yiming Wang, Baoliu Ye

TL;DR

HiDE, a Hierarchical Dictionary-based Entropy modeling framework for learned image compression, decomposes external priors into global structural and local detail dictionaries with cascaded retrieval, enabling structured and efficient utilization of external information.

Abstract

Learned image compression (LIC) has achieved remarkable coding efficiency, where entropy modeling plays a pivotal role in minimizing bitrate through informative priors. Existing methods predominantly exploit internal contexts within the input image, yet the rich external priors embedded in large-scale training data remain largely underutilized. Recent advances in dictionary-based entropy models have demonstrated that incorporating external priors can substantially enhance compression performance. However, current approaches organize heterogeneous external priors within a single-level dictionary, resulting in imbalanced utilization and limited representational capacity. Moreover, effective entropy modeling requires not only expressive priors but also a parameter estimation network capable of interpreting them. To address these challenges, we propose HiDE, a Hierarchical Dictionary-based Entropy modeling framework for learned image compression. HiDE decomposes external priors into global structural and local detail dictionaries with cascaded retrieval, enabling structured and efficient utilization of external information. Moreover, a context-aware parameter estimator with parallel multi-receptive-field design is introduced to adaptively exploit heterogeneous contexts for accurate conditional probability estimation. Experimental results show that HiDE achieves 18.5%, 21.99%, and 24.01% BD-rate savings over VTM-12.1 on the Kodak, CLIC, and Tecnick datasets, respectively.

HiDE: Hierarchical Dictionary-Based Entropy Modeling for Learned Image Compression

TL;DR

HiDE, a Hierarchical Dictionary-based Entropy modeling framework for learned image compression, decomposes external priors into global structural and local detail dictionaries with cascaded retrieval, enabling structured and efficient utilization of external information.

Abstract

Learned image compression (LIC) has achieved remarkable coding efficiency, where entropy modeling plays a pivotal role in minimizing bitrate through informative priors. Existing methods predominantly exploit internal contexts within the input image, yet the rich external priors embedded in large-scale training data remain largely underutilized. Recent advances in dictionary-based entropy models have demonstrated that incorporating external priors can substantially enhance compression performance. However, current approaches organize heterogeneous external priors within a single-level dictionary, resulting in imbalanced utilization and limited representational capacity. Moreover, effective entropy modeling requires not only expressive priors but also a parameter estimation network capable of interpreting them. To address these challenges, we propose HiDE, a Hierarchical Dictionary-based Entropy modeling framework for learned image compression. HiDE decomposes external priors into global structural and local detail dictionaries with cascaded retrieval, enabling structured and efficient utilization of external information. Moreover, a context-aware parameter estimator with parallel multi-receptive-field design is introduced to adaptively exploit heterogeneous contexts for accurate conditional probability estimation. Experimental results show that HiDE achieves 18.5%, 21.99%, and 24.01% BD-rate savings over VTM-12.1 on the Kodak, CLIC, and Tecnick datasets, respectively.
Paper Structure (18 sections, 11 equations, 7 figures, 4 tables)

This paper contains 18 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: BD-Rate vs. decoding latency on the Kodak dataset (top-left indicates better performance).
  • Figure 2: Visualization of dictionary entry utilization. The top two rows show attention heatmaps from DCAE on Kodak images, highlighting the dominant entries in the retrieval process. The bottom row presents the mean attention scores of dictionary entries across the Kodak dataset, where DCAE’s single-level dictionary (left) exhibits a highly skewed usage, while our proposed Global (middle) and Detail (right) dictionaries achieve more balanced utilization.
  • Figure 3: Overview of the proposed HiDE framework. HiDE integrates hierarchical dictionaries ($\delta_{G}, \delta_{D}$) to retrieve external priors, which are fused with the hyperprior $\mathcal{F}_z$ and channel-wise autoregressive context. The aggregated context guides the entropy parameter estimator $f_E$ and the latent residual predictor $f_{LRP}$ to refine reconstruction quality and improve entropy modeling accuracy.
  • Figure 4: Overview of the proposed Hierarchical Dictionary-based Entropy Model, which consists of Hierarchical Dictionary-based Context Model (HD) and Context-aware Parameter Estimation (CaPE). Top-left: slice-wise entropy model that aggregates hyperprior $\mathcal{F}_z$ and autoregressive context $\bar{y}_{<i}$. Top-right (blue box $e_i$): internal structure of the $i$-th slice network employing the Hierarchical Dictionary Cross-Attention (HDCA) module to retrieve external priors for parameter estimation ($f_E$) and residual prediction ($f_{LRP}$). Bottom (yellow box): HDCA performs a two-stage retrieval: first querying the Global Dictionary$\delta_G$ for global structural priors $\mathcal{C}_{G_i}$, and then querying the Detail Dictionary$\delta_D$ for local detailed texture priors $\mathcal{C}_{D_i}$ conditioned on the global context. Bottom-right (pink box): fusion stages generating intermediate $X_{e_i}$ and final dictionary-aware context $\mathcal{F}_{dict_i}$.
  • Figure 5: Illustration of the proposed Context-aware Parameter Estimation (CaPE) module. Left: CaPE was employed in two module in entropy model. Firstly, CaPE served as the parameter estimator $f_E$ extracts a shared context representation to predict both $\mu$ and $\sigma$. Then $f_{LRP}$ predicts the quantization residual $r$. Middle: The internal structure of the context extractor. Right: The task-specific heads for mean, scale, and residual prediction, implemented with lightweight stacked $3\times3$ convolutions and GELU activations.
  • ...and 2 more figures