Bridging Compressed Image Latents and Multimodal Large Language Models

Chia-Hao Kao; Cheng Chien; Yu-Jen Tseng; Yi-Hsin Chen; Alessandro Gnutti; Shao-Yuan Lo; Wen-Hsiao Peng; Riccardo Leonardi

Bridging Compressed Image Latents and Multimodal Large Language Models

Chia-Hao Kao, Cheng Chien, Yu-Jen Tseng, Yi-Hsin Chen, Alessandro Gnutti, Shao-Yuan Lo, Wen-Hsiao Peng, Riccardo Leonardi

TL;DR

This work tackles the practical challenge of deploying Multimodal Large Language Models on resource-constrained devices by compressing images into latents that are adapted for MLLM vision encoders rather than reconstructing full images. It introduces a lightweight transform-neck and a surrogate loss to bridge compressed latents to the MLLM’s visual encoder, enabling training without back-propagating through the entire MLLM. The method proves broadly applicable across neural codecs and MLLMs, achieving substantial bitrate reductions (up to 60-80% at the same accuracy) and dramatic decoding complexity savings (~95% kMAC/pixel) while maintaining high task performance. This framework facilitates practical, scalable deployment of MLLMs in bandwidth- and compute-constrained settings by shifting the adaptation burden away from the large language model itself toward a compact latent-domain bridge.

Abstract

This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks that adopt Multimodal Large Language Models (MLLMs). MLLMs have extended the success of large language models to modalities (e.g. images) beyond text, but their billion scale hinders deployment on resource-constrained end devices. While cloud-hosted MLLMs could be available, transmitting raw, uncompressed images captured by end devices to the cloud requires an efficient image compression system. To address this, we focus on emerging neural image compression and propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks. Given the huge scale of MLLMs, our framework excludes the entire downstream MLLM except part of its visual encoder from training our system. This stands out from most existing coding for machine approaches that involve downstream networks in training and thus could be impractical when the networks are MLLMs. The proposed framework is general in that it is applicable to various MLLMs, neural image codecs, and multiple application scenarios, where the neural image codec can be (1) pre-trained for human perception without updating, (2) fully updated for joint human and machine perception, or (3) fully updated for only machine perception. Extensive experiments on different neural image codecs and various MLLMs show that our method achieves great rate-accuracy performance with much less complexity.

Bridging Compressed Image Latents and Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (35 sections, 4 equations, 16 figures, 10 tables)

This paper contains 35 sections, 4 equations, 16 figures, 10 tables.

Introduction
Related Works
Multimodal Large Language Models
Image Coding for Machines
Proposed Method
Preliminaries: Neural Image Codecs
Overall Framework
Transform-neck
Surrogate Loss
Training Procedure
Phase 1: Transform-neck Training
Phase 2: Joint Optimization
Experimental Results
Experimental Setting
Training Details and Datasets.
...and 20 more sections

Figures (16)

Figure 1: On the left is inadequate frameworks for image compression for MLLMs, where the image codec is trained for (a) human perception, (b) the downstream task network, or (c) compressing the intermediate features of the task network. On the right is the proposed transform-neck and surrogate loss under three distinct scenarios, with the image codec (d1) pre-trained for human perception, (d2) updated for joint human and machine perception, or (d3) updated for machine perception.
Figure 2: Overall architecture of the proposed method.
Figure 2: Evaluated tasks with corresponding dataset and MLLM.
Figure 3: Rate-accuracy comparison using various MLLMs on several tasks.
Figure 4: Reconstruction performance comparison on kodak.
...and 11 more figures

Bridging Compressed Image Latents and Multimodal Large Language Models

TL;DR

Abstract

Bridging Compressed Image Latents and Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)