Learned Image Transmission with Hierarchical Variational Autoencoder
Guangyi Zhang, Hanlei Li, Yunlong Cai, Qiyu Hu, Guanding Yu, Runmin Zhang
TL;DR
This paper tackles robust, high-efficiency image transmission over wireless channels by introducing a hierarchical joint source-channel coding framework (HJSCC) built on a hierarchical variational autoencoder. The transmitter uses both bottom-up and top-down paths to autoregressively generate multiple latent representations which are encoded by several JSCC encoders, enabling dynamic, rate-adaptive transmission through masking and entropy-informed priors. A novel training objective couples rate terms derived from latent priors with distortion terms, and the approach is extended to a JSCC-with-feedback setting, modeling transmission as a probabilistic sampling process over noisy channels. Experimental results on Kodak and CLIC2022 show that HJSCC achieves superior rate-distortion performance and robustness to channel noise, with ablations confirming the effectiveness of spatial grouping and the rate attention module. The work offers a practical path toward scalable, adaptive, and feedback-enabled image transmission in future networks.
Abstract
In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise. The source code will be made available upon acceptance.
