Table of Contents
Fetching ...

Beyond Hallucinations: A Multimodal-Guided Task-Aware Generative Image Compression for Ultra-Low Bitrate

Kaile Wang, Lijun He, Haisheng Fu, Haixia Bi, Fan Li

TL;DR

Ultra-low bitrate generative image compression often suffers semantic deviations due to hallucinations. The authors propose MTGC, a multimodal-guided, task-aware compression framework that uses a concise text caption, a highly compressed image, and semantic pseudo-words to guide a diffusion-based decoder through MGDD, with TASCM producing SPWs. The method uses a three-stage training protocol and a dual-path conditioning to balance high-level semantics and low-level visuals, achieving improved semantic consistency and competitive perceptual and pixel fidelity on standard benchmarks. Ablation studies demonstrate the complementary benefits of each guidance stream and the superior performance of the full MTGC setup. This work advances reliable semantic compression for bandwidth-constrained semantic communication scenarios.

Abstract

Generative image compression has recently shown impressive perceptual quality, but often suffers from semantic deviations caused by generative hallucinations at ultra-low bitrate (bpp < 0.05), limiting its reliable deployment in bandwidth-constrained 6G semantic communication scenarios. In this work, we reassess the positioning and role of of multimodal guidance, and propose a Multimodal-Guided Task-Aware Generative Image Compression (MTGC) framework. Specifically, MTGC integrates three guidance modalities to enhance semantic consistency: a concise but robust text caption for global semantics, a highly compressed image (HCI) retaining low-level visual information, and Semantic Pseudo-Words (SPWs) for fine-grained task-relevant semantics. The SPWs are generated by our designed Task-Aware Semantic Compression Module (TASCM), which operates in a task-oriented manner to drive the multi-head self-attention mechanism to focus on and extract semantics relevant to the generation task while filtering out redundancy. Subsequently, to facilitate the synergistic guidance of these modalities, we design a Multimodal-Guided Diffusion Decoder (MGDD) employing a dual-path cooperative guidance mechanism that synergizes cross-attention and ControlNet additive residuals to precisely inject these three guidance into the diffusion process, and leverages the diffusion model's powerful generative priors to reconstruct the image. Extensive experiments demonstrate that MTGC consistently improves semantic consistency (e.g., DISTS drops by 10.59% on the DIV2K dataset) while also achieving remarkable gains in perceptual quality and pixel-level fidelity at ultra-low bitrate.

Beyond Hallucinations: A Multimodal-Guided Task-Aware Generative Image Compression for Ultra-Low Bitrate

TL;DR

Ultra-low bitrate generative image compression often suffers semantic deviations due to hallucinations. The authors propose MTGC, a multimodal-guided, task-aware compression framework that uses a concise text caption, a highly compressed image, and semantic pseudo-words to guide a diffusion-based decoder through MGDD, with TASCM producing SPWs. The method uses a three-stage training protocol and a dual-path conditioning to balance high-level semantics and low-level visuals, achieving improved semantic consistency and competitive perceptual and pixel fidelity on standard benchmarks. Ablation studies demonstrate the complementary benefits of each guidance stream and the superior performance of the full MTGC setup. This work advances reliable semantic compression for bandwidth-constrained semantic communication scenarios.

Abstract

Generative image compression has recently shown impressive perceptual quality, but often suffers from semantic deviations caused by generative hallucinations at ultra-low bitrate (bpp < 0.05), limiting its reliable deployment in bandwidth-constrained 6G semantic communication scenarios. In this work, we reassess the positioning and role of of multimodal guidance, and propose a Multimodal-Guided Task-Aware Generative Image Compression (MTGC) framework. Specifically, MTGC integrates three guidance modalities to enhance semantic consistency: a concise but robust text caption for global semantics, a highly compressed image (HCI) retaining low-level visual information, and Semantic Pseudo-Words (SPWs) for fine-grained task-relevant semantics. The SPWs are generated by our designed Task-Aware Semantic Compression Module (TASCM), which operates in a task-oriented manner to drive the multi-head self-attention mechanism to focus on and extract semantics relevant to the generation task while filtering out redundancy. Subsequently, to facilitate the synergistic guidance of these modalities, we design a Multimodal-Guided Diffusion Decoder (MGDD) employing a dual-path cooperative guidance mechanism that synergizes cross-attention and ControlNet additive residuals to precisely inject these three guidance into the diffusion process, and leverages the diffusion model's powerful generative priors to reconstruct the image. Extensive experiments demonstrate that MTGC consistently improves semantic consistency (e.g., DISTS drops by 10.59% on the DIV2K dataset) while also achieving remarkable gains in perceptual quality and pixel-level fidelity at ultra-low bitrate.

Paper Structure

This paper contains 32 sections, 12 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Visual comparison of different methods at Ultra-low Bitrate. (b) ELIChe2022elic suffers from blurring and oversmoothing, (c) VQIRwei2024toward exhibits artifacts, (d) PICSlei2023text+ exhibit semantic deviations, while (e) our method achieves superior perceptual quality and semantic consistency.
  • Figure 2: Overall framework of the proposed MTGC. The encoder extracts three guidance modalities from the original image, including: a text caption, an HCI, and SPWs generated by the TASCM. The text caption and SPWs are losslessly compressed into bitstreams using Zstd10.17487/RFC8878 , while the HCI is compressed via arithmetic coding. The decoder leverages a MGDD, which integrates these signals with its powerful generative priors to reconstruct the image with high perceptual quality and semantic consistency.
  • Figure 3: Information Entropy Comparison: HCI vs. Sketch and Semantic map (denoted as “Semantic” in the figure). HCI exhibits significantly higher information entropy, demonstrating its superior information density.
  • Figure 4: Detailed architecture of the TASCM, comprising a SemEnc and a CMAN, and supporting end-to-end joint training with downstream generative models.
  • Figure 5: Detailed architecture of MGDD, leveraging a pre-trained diffusion model with dual-path conditioning that combines cross-attention and ControlNet additive residuals for multimodal cooperative guidance.
  • ...and 8 more figures