Extremely low-bitrate Image Compression Semantically Disentangled by LMMs from a Human Perception Perspective
Juan Song, Lijie Yang, Mingtao Feng
TL;DR
The paper proposes SEDIC, a semantically disentangled framework for extremely low-bitrate image compression that uses large multimodal models to extract compact semantic representations and a training-free ORAG built on ControlNet to progressively restore content. A multi-stage decoder reconstructs images object-by-object from an extremely compressed reference image, guided by object-level descriptions and semantic masks to maintain semantic consistency and perceptual quality. Across Kodak, DIV2K, and CLIC2020, SEDIC yields superior perceptual metrics at or below $0.05$ bpp and demonstrates robust performance on simple and complex scenes, supported by ablations and a user study. This approach highlights a viable path for integrating LMMs with controllable diffusion decoders to achieve high-fidelity reconstructions at ultra-low bitrates.
Abstract
It remains a significant challenge to compress images at extremely low bitrate while achieving both semantic consistency and high perceptual quality. Inspired by human progressive perception mechanism, we propose a Semantically Disentangled Image Compression framework (SEDIC) in this paper. Initially, an extremely compressed reference image is obtained through a learned image encoder. Then we leverage LMMs to extract essential semantic components, including overall descriptions, object detailed description, and semantic segmentation masks. We propose a training-free Object Restoration model with Attention Guidance (ORAG) built on pre-trained ControlNet to restore object details conditioned by object-level text descriptions and semantic masks. Based on the proposed ORAG, we design a multistage semantic image decoder to progressively restore the details object by object, starting from the extremely compressed reference image, ultimately generating high-quality and high-fidelity reconstructions. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at extremely low-bitrates ($\le$ 0.05 bpp).
