Table of Contents
Fetching ...

High Efficiency Image Compression for Large Visual-Language Models

Binzhe Li, Shurun Wang, Shiqi Wang, Yan Ye

TL;DR

The paper tackles image compression for machine vision by LVLMs, addressing the mismatch between human-centric codecs and machine-task needs. It introduces a variable bitrate framework composed of a semantically guided pre-editing module and an end-to-end codec trained with token-based losses, including a token-rank term, to preserve task-relevant semantics under bandwidth constraints. The approach uses semantic tokens derived from large models to guide editing and a variational autoencoder with hyperpriors and an adaptive layer to achieve flexible rate-quality trade-offs, formalized through a joint loss $\\mathcal{L} = \\lambda_R \\mathcal{L}_R + \\lambda_D \\mathcal{L}_D + \\lambda_T \\mathcal{L}_T$ with $\\\\mathcal{L}_T = \\\lambda_{tk} \\\mathcal{L}_{tk} + \\\lambda_{rk} \\\mathcal{L}_{rk}$ and a rank loss based on token eigenvalues. Experiments with LVLMs OFA and OP on COCO and RefCOCO/RefCOCO+ demonstrate BD-Rate improvements over the VVC anchor and strong generalization across multimodal tasks, including real-time decoding capability. This work highlights a practical path to machine-oriented, semantic-preserving image coding that aligns with the downstream needs of large multimodal models and could influence future compression strategies for multi-modal data.

Abstract

In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.

High Efficiency Image Compression for Large Visual-Language Models

TL;DR

The paper tackles image compression for machine vision by LVLMs, addressing the mismatch between human-centric codecs and machine-task needs. It introduces a variable bitrate framework composed of a semantically guided pre-editing module and an end-to-end codec trained with token-based losses, including a token-rank term, to preserve task-relevant semantics under bandwidth constraints. The approach uses semantic tokens derived from large models to guide editing and a variational autoencoder with hyperpriors and an adaptive layer to achieve flexible rate-quality trade-offs, formalized through a joint loss with and a rank loss based on token eigenvalues. Experiments with LVLMs OFA and OP on COCO and RefCOCO/RefCOCO+ demonstrate BD-Rate improvements over the VVC anchor and strong generalization across multimodal tasks, including real-time decoding capability. This work highlights a practical path to machine-oriented, semantic-preserving image coding that aligns with the downstream needs of large multimodal models and could influence future compression strategies for multi-modal data.

Abstract

In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.
Paper Structure (19 sections, 3 equations, 6 figures, 3 tables)

This paper contains 19 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The paradigm of the proposed image compression scheme for LVLMs, including the encoder, decoder, and loss function design. The pre-editing module includes the semantic tokens extractor and the pre-editing network, and the end-to-end codec is composed of the encoder and decoder for compressing and reconstructing the preprocessed image. The LVLMs are regarded as the ultimate receivers of the reconstructed images.
  • Figure 2: Illustration of the proposed semantic tokens-based pre-editing network. The proposed pre-editing network consists of three parts: semantic token refinement, down-sampling, and up-sampling. The semantic token refinement branch refines the semantic feature representation in different scales. The down-sampling and up-sampling branches utilize semantic tokens at multiple scales.
  • Figure 3: The paradigm of the proposed variable bitrate codec. The codec includes the $g_{enc}$, $g_{dec}$, $h_{enc}$, and $h_{dec}$, which are composed of the convolutions and compression ratio adaption layers.
  • Figure 4: RA performance comparisons with VVC anchor for image captioning, image-text retrieval, and vision grounding tasks that are completed with LVLMs.
  • Figure 5: Ablation studies for image captioning and image-text retrieval tasks based on LVLMs.
  • ...and 1 more figures