Table of Contents
Fetching ...

Loss-Oriented Ranking for Automated Visual Prompting in LVLMs

Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu, Junwen Pan, Kuan Cheng, Qi She, Shanghang Zhang

TL;DR

This work tackles the challenge of manually crafting visual prompts for LVLMs by introducing AutoV, a retrieval-based framework that automatically picks the best visual prompt from a compact pool conditioned on each image-question pair. AutoV uses a lightweight ranking network that fuses visual-prompt tokens with text, trained with reward-based supervision derived from LVLM prediction losses via automated data generation, and a robust inference pipeline that filters and selects the top prompt. The approach yields consistent, model-agnostic gains across a wide set of LVLM architectures and benchmarks, including notable improvements on VizWiz and MMMU, and demonstrates strong transferability to closed-source models. The results indicate that adaptive, data-driven visual prompting can substantially enhance multimodal reasoning without model fine-tuning, offering a practical path to plug-in improvements for LVLMs in diverse applications.

Abstract

Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we develop an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experiments indicate that AutoV enhances the performance of various LVLMs across multiple image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{10.2}\%$ accuracy gain on VizWiz, and AutoV boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, highlighting its potential as an optimal visual prompting method.

Loss-Oriented Ranking for Automated Visual Prompting in LVLMs

TL;DR

This work tackles the challenge of manually crafting visual prompts for LVLMs by introducing AutoV, a retrieval-based framework that automatically picks the best visual prompt from a compact pool conditioned on each image-question pair. AutoV uses a lightweight ranking network that fuses visual-prompt tokens with text, trained with reward-based supervision derived from LVLM prediction losses via automated data generation, and a robust inference pipeline that filters and selects the top prompt. The approach yields consistent, model-agnostic gains across a wide set of LVLM architectures and benchmarks, including notable improvements on VizWiz and MMMU, and demonstrates strong transferability to closed-source models. The results indicate that adaptive, data-driven visual prompting can substantially enhance multimodal reasoning without model fine-tuning, offering a practical path to plug-in improvements for LVLMs in diverse applications.

Abstract

Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we develop an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experiments indicate that AutoV enhances the performance of various LVLMs across multiple image understanding tasks. For instance, LLaVA-OV with AutoV achieves accuracy gain on VizWiz, and AutoV boosts Qwen2.5-VL by on MMMU, highlighting its potential as an optimal visual prompting method.

Paper Structure

This paper contains 28 sections, 6 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Benchmark-level evaluation within FGVP, Circle, API, and our AutoV. The above methods employ heuristic visual prompts: (1) FGVP applies blur effects, (2) Circle highlights foreground objects in red, and (3) API utilizes attention maps as masks. In contrast to existing methods, AutoV learns to retrieve the optimal visual prompt using a lightweight ranking network.
  • Figure 2: The illustration of AutoV. It comprises four key components: representation extraction from candidates, ranking network for reward score, reward-supervised training, and inference. It retrieves the most suitable visual prompt tailored to each query-image pair.
  • Figure 3: The reward loss of AutoV. As a concrete example, we illustrate the pairwise combination process using the visual prompts from Figure \ref{['fig:main']}, demonstrating how our supervision is applied.
  • Figure 4: Retrieval distribution on MMVet task. The plots are independent accuracy for each prompt.
  • Figure 5: Visualization of retrieved visual prompts faced with different queries. The VQA case is sampled in the LLaVA-Wild Bench. $\text{VP}_1$ and $\text{VP}_3$ come from API, $\text{VP}_2$ is from RedCircle, and $\text{VP}_4$ is from FGVP.
  • ...and 3 more figures