Table of Contents
Fetching ...

Extremely low-bitrate Image Compression Semantically Disentangled by LMMs from a Human Perception Perspective

Juan Song, Lijie Yang, Mingtao Feng

TL;DR

The paper proposes SEDIC, a semantically disentangled framework for extremely low-bitrate image compression that uses large multimodal models to extract compact semantic representations and a training-free ORAG built on ControlNet to progressively restore content. A multi-stage decoder reconstructs images object-by-object from an extremely compressed reference image, guided by object-level descriptions and semantic masks to maintain semantic consistency and perceptual quality. Across Kodak, DIV2K, and CLIC2020, SEDIC yields superior perceptual metrics at or below $0.05$ bpp and demonstrates robust performance on simple and complex scenes, supported by ablations and a user study. This approach highlights a viable path for integrating LMMs with controllable diffusion decoders to achieve high-fidelity reconstructions at ultra-low bitrates.

Abstract

It remains a significant challenge to compress images at extremely low bitrate while achieving both semantic consistency and high perceptual quality. Inspired by human progressive perception mechanism, we propose a Semantically Disentangled Image Compression framework (SEDIC) in this paper. Initially, an extremely compressed reference image is obtained through a learned image encoder. Then we leverage LMMs to extract essential semantic components, including overall descriptions, object detailed description, and semantic segmentation masks. We propose a training-free Object Restoration model with Attention Guidance (ORAG) built on pre-trained ControlNet to restore object details conditioned by object-level text descriptions and semantic masks. Based on the proposed ORAG, we design a multistage semantic image decoder to progressively restore the details object by object, starting from the extremely compressed reference image, ultimately generating high-quality and high-fidelity reconstructions. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at extremely low-bitrates ($\le$ 0.05 bpp).

Extremely low-bitrate Image Compression Semantically Disentangled by LMMs from a Human Perception Perspective

TL;DR

The paper proposes SEDIC, a semantically disentangled framework for extremely low-bitrate image compression that uses large multimodal models to extract compact semantic representations and a training-free ORAG built on ControlNet to progressively restore content. A multi-stage decoder reconstructs images object-by-object from an extremely compressed reference image, guided by object-level descriptions and semantic masks to maintain semantic consistency and perceptual quality. Across Kodak, DIV2K, and CLIC2020, SEDIC yields superior perceptual metrics at or below bpp and demonstrates robust performance on simple and complex scenes, supported by ablations and a user study. This approach highlights a viable path for integrating LMMs with controllable diffusion decoders to achieve high-fidelity reconstructions at ultra-low bitrates.

Abstract

It remains a significant challenge to compress images at extremely low bitrate while achieving both semantic consistency and high perceptual quality. Inspired by human progressive perception mechanism, we propose a Semantically Disentangled Image Compression framework (SEDIC) in this paper. Initially, an extremely compressed reference image is obtained through a learned image encoder. Then we leverage LMMs to extract essential semantic components, including overall descriptions, object detailed description, and semantic segmentation masks. We propose a training-free Object Restoration model with Attention Guidance (ORAG) built on pre-trained ControlNet to restore object details conditioned by object-level text descriptions and semantic masks. Based on the proposed ORAG, we design a multistage semantic image decoder to progressively restore the details object by object, starting from the extremely compressed reference image, ultimately generating high-quality and high-fidelity reconstructions. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at extremely low-bitrates ( 0.05 bpp).

Paper Structure

This paper contains 20 sections, 4 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Starting from the extremely compressed reference image, our proposed ORAG firstly progressively restores details object by object conditioned by object descriptions and semantic masks. Finally, the overall description is used to enhance the overall perceptual quality.
  • Figure 2: Overall framework of SEDIC. (a) Semantically Disentangled image encoder consists of an image textualization encoder to extract overall and object-level detailed descriptions, a semantic mask encoder, and an image encoder to obtain an extremely compressed reference image. (b) Multi-stage Semantic Image Decoder consists of several Object Restoration models with Attention Guidance (ORAG) to restore object details and a conditional text-to-image diffusion model to restore the entire image. (c) The ORAG model restores the object details given object text descriptions and semantic masks.
  • Figure 3: Question template designed to guide GPT-4 Vision in image-to-text encoding.The template comprises three stages: (1) object listing, (2) fine-grained object-level textualization, and (3) holistic image-level captioning.
  • Figure 4: Quantitative comparisons with SOTA methods in terms of perceptual quality (LPIPS$\downarrow$ / DISTS$\downarrow$ / FID$\downarrow$/ KID$\downarrow$) on Kodak kodak, DIV2K validation DIV2K, and CLIC2020 clic2020 datasets.
  • Figure 5: We visually compare our SEDIC framework with stable diffusion-based methods on Kodak and DIV2K validation datasets under extremely low-bitrate settings. The corresponding bpp and LPIPS values are displayed below the images.
  • ...and 7 more figures