Table of Contents
Fetching ...

Prompt-Guided Prefiltering for VLM Image Compression

Bardia Azizian, Ivan V. Bajic

Abstract

The rapid progress of large Vision-Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically hosted, hence efficient image compression becomes crucial. However, traditional human-centric codecs are suboptimal in this setting because they preserve many task-irrelevant details. Existing Image Coding for Machines (ICM) methods also fall short, as they assume a fixed set of downstream tasks and cannot adapt to prompt-driven VLMs with an open-ended variety of objectives. We propose a lightweight, plug-and-play, prompt-guided prefiltering module to identify image regions most relevant to the text prompt, and consequently to the downstream task. The module preserves important details while smoothing out less relevant areas to improve compression efficiency. It is codec-agnostic and can be applied before conventional and learned encoders. Experiments on several VQA benchmarks show that our approach achieves a 25-50% average bitrate reduction while maintaining the same task accuracy. Our source code is available at https://github.com/bardia-az/pgp-vlm-compression.

Prompt-Guided Prefiltering for VLM Image Compression

Abstract

The rapid progress of large Vision-Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically hosted, hence efficient image compression becomes crucial. However, traditional human-centric codecs are suboptimal in this setting because they preserve many task-irrelevant details. Existing Image Coding for Machines (ICM) methods also fall short, as they assume a fixed set of downstream tasks and cannot adapt to prompt-driven VLMs with an open-ended variety of objectives. We propose a lightweight, plug-and-play, prompt-guided prefiltering module to identify image regions most relevant to the text prompt, and consequently to the downstream task. The module preserves important details while smoothing out less relevant areas to improve compression efficiency. It is codec-agnostic and can be applied before conventional and learned encoders. Experiments on several VQA benchmarks show that our approach achieves a 25-50% average bitrate reduction while maintaining the same task accuracy. Our source code is available at https://github.com/bardia-az/pgp-vlm-compression.

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The overall block diagram of the proposed method
  • Figure 2: Visualization of the prefiltered versions of a single image generated by our method under three different prompts.
  • Figure 3: Rate–Accuracy curves for InternVL3-9B (first row) and LLaVA-OV-7B (second row) across SEEDBench, MME, and MMBench benchmarks.
  • Figure 4: Rate–accuracy curves for InternVL3-9B on the MME dataset, illustrating the impact of different components of our method.
  • Figure 5: Rate–accuracy curves for InternVL3-9B on SEEDBench dataset