Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task
Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin
TL;DR
DISCOVER introduces a semantics-driven, versatile image codec that jointly optimizes for human perception and machine-vision tasks by performing encoder-side semantic disentanglement with multimodal LLMs and grounding, followed by diffusion-based semantic composition at decoding. By transmitting only task-relevant latent streams and leveraging diffusion priors, the method achieves substantial bitrate reductions across object detection, segmentation, and classification while maintaining or exceeding human-perception quality, without retraining for new tasks. Two-stage training stabilizes learning and enables partial-region transmission, with quantitative gains including BD-rates of up to -80% on multiple machine-vision tasks and BD-FID improvements for detection, illustrating robust cross-domain applicability. The approach offers practical impact for real-world deployments by reducing bandwidth, enabling adaptive task-aware coding, and leveraging powerful generative priors to reconstruct high-fidelity images.
Abstract
While learned image compression methods have achieved impressive results in either human visual perception or machine vision tasks, they are often specialized only for one domain. This drawback limits their versatility and generalizability across scenarios and also requires retraining to adapt to new applications-a process that adds significant complexity and cost in real-world scenarios. In this study, we introduce an innovative semantics DISentanglement and COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks. The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side. At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks. Extensive experimental evaluations substantiate the robustness and effectiveness of DISCOVER, demonstrating superior performance in fulfilling the dual objectives of human and machine vision requirements.
