Table of Contents
Fetching ...

Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs

Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

TL;DR

SDComp introduces a semantically disentangled compression framework that uses visual grounding to locate objects, Large Multimodal Models to rank object importance, and semantically structured compression to transmit content by importance. By encoding only the most task-relevant regions in a progressive bitstream, SDComp improves downstream performance and reduces bitrate across segmentation, detection, classification, VQA, and captioning tasks, outperforming VVC, ELIC, and prior ICM methods with BD-rate gains. The approach offers flexible, interpretable, and task-adaptable bitstreams, enabling partial decoding while preserving semantic integrity. This work demonstrates the practical potential of integrating LMM semantics into image coding to support diverse machine-oriented vision tasks efficiently.

Abstract

We present a new image compression paradigm to achieve ``intelligently coding for machine'' by cleverly leveraging the common sense of Large Multimodal Models (LMMs). We are motivated by the evidence that large language/multimodal models are powerful general-purpose semantics predictors for understanding the real world. Different from traditional image compression typically optimized for human eyes, the image coding for machines (ICM) framework we focus on requires the compressed bitstream to more comply with different downstream intelligent analysis tasks. To this end, we employ LMM to \textcolor{red}{tell codec what to compress}: 1) first utilize the powerful semantic understanding capability of LMMs w.r.t object grounding, identification, and importance ranking via prompts, to disentangle image content before compression, 2) and then based on these semantic priors we accordingly encode and transmit objects of the image in order with a structured bitstream. In this way, diverse vision benchmarks including image classification, object detection, instance segmentation, etc., can be well supported with such a semantically structured bitstream. We dub our method ``\textit{SDComp}'' for ``\textit{S}emantically \textit{D}isentangled \textit{Comp}ression'', and compare it with state-of-the-art codecs on a wide variety of different vision tasks. SDComp codec leads to more flexible reconstruction results, promised decoded visual quality, and a more generic/satisfactory intelligent task-supporting ability.

Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs

TL;DR

SDComp introduces a semantically disentangled compression framework that uses visual grounding to locate objects, Large Multimodal Models to rank object importance, and semantically structured compression to transmit content by importance. By encoding only the most task-relevant regions in a progressive bitstream, SDComp improves downstream performance and reduces bitrate across segmentation, detection, classification, VQA, and captioning tasks, outperforming VVC, ELIC, and prior ICM methods with BD-rate gains. The approach offers flexible, interpretable, and task-adaptable bitstreams, enabling partial decoding while preserving semantic integrity. This work demonstrates the practical potential of integrating LMM semantics into image coding to support diverse machine-oriented vision tasks efficiently.

Abstract

We present a new image compression paradigm to achieve ``intelligently coding for machine'' by cleverly leveraging the common sense of Large Multimodal Models (LMMs). We are motivated by the evidence that large language/multimodal models are powerful general-purpose semantics predictors for understanding the real world. Different from traditional image compression typically optimized for human eyes, the image coding for machines (ICM) framework we focus on requires the compressed bitstream to more comply with different downstream intelligent analysis tasks. To this end, we employ LMM to \textcolor{red}{tell codec what to compress}: 1) first utilize the powerful semantic understanding capability of LMMs w.r.t object grounding, identification, and importance ranking via prompts, to disentangle image content before compression, 2) and then based on these semantic priors we accordingly encode and transmit objects of the image in order with a structured bitstream. In this way, diverse vision benchmarks including image classification, object detection, instance segmentation, etc., can be well supported with such a semantically structured bitstream. We dub our method ``\textit{SDComp}'' for ``\textit{S}emantically \textit{D}isentangled \textit{Comp}ression'', and compare it with state-of-the-art codecs on a wide variety of different vision tasks. SDComp codec leads to more flexible reconstruction results, promised decoded visual quality, and a more generic/satisfactory intelligent task-supporting ability.
Paper Structure (16 sections, 6 figures)

This paper contains 16 sections, 6 figures.

Figures (6)

  • Figure 1: Illustration of (a) Tasks-driven ICM framework, (b) Semantically structured image compression and hand-crafted regions selection, (c) Our SDComp framework driven by LMM. SDComp employs visual-grounding to structure the image by dividing it into distinct regions. These regions are then evaluated for importance using a Large Multimodal Model (LMM). Based on their importance, the regions are encoded and transmitted sequentially.
  • Figure 2: Overall framework of SDComp. (a) Grounded-SAM first extracts object grounding information for an image. (b) Such priors along with the designed prompts and generated captions instruct LMM to rank objects' importance, serving as a basis for (c) Semantically Structured Image Compression.
  • Figure 3: Comparison of detection and visual grounding. Compared with object detection, visual grounding can recognize open-vocabulary categories.
  • Figure 4: Q1:The prompt template for generating short and long captions. Q2:The prompt template for ranking the importance of objects, where Grounding data contains the label, ID, and bounding box information.
  • Figure 5: Performance Comparison of (a) Segmentation on COCO dataset. (b) Detection on COCO dataset. (c) Classification on CUB-200-2011 dataset.
  • ...and 1 more figures