Table of Contents
Fetching ...

Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task

Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

TL;DR

DISCOVER introduces a semantics-driven, versatile image codec that jointly optimizes for human perception and machine-vision tasks by performing encoder-side semantic disentanglement with multimodal LLMs and grounding, followed by diffusion-based semantic composition at decoding. By transmitting only task-relevant latent streams and leveraging diffusion priors, the method achieves substantial bitrate reductions across object detection, segmentation, and classification while maintaining or exceeding human-perception quality, without retraining for new tasks. Two-stage training stabilizes learning and enables partial-region transmission, with quantitative gains including BD-rates of up to -80% on multiple machine-vision tasks and BD-FID improvements for detection, illustrating robust cross-domain applicability. The approach offers practical impact for real-world deployments by reducing bandwidth, enabling adaptive task-aware coding, and leveraging powerful generative priors to reconstruct high-fidelity images.

Abstract

While learned image compression methods have achieved impressive results in either human visual perception or machine vision tasks, they are often specialized only for one domain. This drawback limits their versatility and generalizability across scenarios and also requires retraining to adapt to new applications-a process that adds significant complexity and cost in real-world scenarios. In this study, we introduce an innovative semantics DISentanglement and COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks. The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side. At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks. Extensive experimental evaluations substantiate the robustness and effectiveness of DISCOVER, demonstrating superior performance in fulfilling the dual objectives of human and machine vision requirements.

Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task

TL;DR

DISCOVER introduces a semantics-driven, versatile image codec that jointly optimizes for human perception and machine-vision tasks by performing encoder-side semantic disentanglement with multimodal LLMs and grounding, followed by diffusion-based semantic composition at decoding. By transmitting only task-relevant latent streams and leveraging diffusion priors, the method achieves substantial bitrate reductions across object detection, segmentation, and classification while maintaining or exceeding human-perception quality, without retraining for new tasks. Two-stage training stabilizes learning and enables partial-region transmission, with quantitative gains including BD-rates of up to -80% on multiple machine-vision tasks and BD-FID improvements for detection, illustrating robust cross-domain applicability. The approach offers practical impact for real-world deployments by reducing bandwidth, enabling adaptive task-aware coding, and leveraging powerful generative priors to reconstruct high-fidelity images.

Abstract

While learned image compression methods have achieved impressive results in either human visual perception or machine vision tasks, they are often specialized only for one domain. This drawback limits their versatility and generalizability across scenarios and also requires retraining to adapt to new applications-a process that adds significant complexity and cost in real-world scenarios. In this study, we introduce an innovative semantics DISentanglement and COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks. The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side. At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks. Extensive experimental evaluations substantiate the robustness and effectiveness of DISCOVER, demonstrating superior performance in fulfilling the dual objectives of human and machine vision requirements.

Paper Structure

This paper contains 27 sections, 4 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: The codec paradigm of (a) human perception (b) machine vision (c) proposed DISCOVER. It leverages the task-level MLLMs and image-level grounding modules for semantic analysis and disentanglement, transmitting only the bit-streams of task-related elements. At the decoding stage, it incorporates priors from generative models to supplement information, producing high-quality images that simultaneously meet the requirements of human perception and machine vision tasks.
  • Figure 2: The framework of DISCOVER: (1) First, we use MLLM and grounding model to perform semantic analysis and extract the location information of task-related objects. (2) The information is then used for semantic disentanglement encoding, transmitting only task-related information. (3) Finally, we leverage the transmitted information and diffusion priors for semantic composition generation.
  • Figure 3: The labels and localization generation process of grounding modules for the vehicle classification task. Different from the general detection, this design could filter out task-related objects (“bicycle") for the subsequent compression.
  • Figure 4: Visualization of the intermediate process in semantics composition generation. DISCOVER can use the partial task-related compressed latent $\boldsymbol{\tilde{y}}_i$ to generate high quality image $\hat{\boldsymbol{x}}$.
  • Figure 5: Machine vision tasks performance comparison. We use $\diamond$, $\triangledown$, $\circ$ to represent generative, image coding for machine, fidelity-based methods, respectively. Our method is represented by $\star$, and the proposed versatile method outperforms recent methods across three machine vision tasks without retraining while maintaining satisfactory human perception, as shown in Fig. \ref{['fig:human']}.
  • ...and 3 more figures