Table of Contents
Fetching ...

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, Rongrong Ji

TL;DR

This paper tackles the challenge of enabling region-level referring reasoning in Multimodal Large Language Models without costly retraining. It introduces a training-free method that injects visual prompts by optimizing a learnable latent variable added to visual tokens at test time, guided by energy-based objectives to strengthen attention on referring regions. The approach supports four visual prompt types (box, mask, scribble, point) and demonstrates improved referring performance across ROC/RTC tasks, out-of-domain OCR and screenshot tasks, and across multiple MLLMs, while also enhancing interpretability and potentially reducing hallucinations. By avoiding fine-tuning and leveraging attention dynamics, the method offers a flexible, scalable path to grounded multimodal reasoning with notable out-of-domain robustness. The work provides a practical framework for adding referring capabilities to existing MLLMs with modest computational overhead and clear guidelines for stability via EMA and early stopping.

Abstract

In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through test-time optimization of a learnable latent variable. We observe that attention, as the core module of MLLMs, connects text prompt tokens and visual tokens, ultimately determining the final results. Our approach involves adjusting visual tokens from the MLP output at test time, controlling the attention response to ensure text prompt tokens attend to visual tokens in referring regions. We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point. The results demonstrate that our method exhibits out-of-domain generalization and interpretability.

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

TL;DR

This paper tackles the challenge of enabling region-level referring reasoning in Multimodal Large Language Models without costly retraining. It introduces a training-free method that injects visual prompts by optimizing a learnable latent variable added to visual tokens at test time, guided by energy-based objectives to strengthen attention on referring regions. The approach supports four visual prompt types (box, mask, scribble, point) and demonstrates improved referring performance across ROC/RTC tasks, out-of-domain OCR and screenshot tasks, and across multiple MLLMs, while also enhancing interpretability and potentially reducing hallucinations. By avoiding fine-tuning and leveraging attention dynamics, the method offers a flexible, scalable path to grounded multimodal reasoning with notable out-of-domain robustness. The work provides a practical framework for adding referring capabilities to existing MLLMs with modest computational overhead and clear guidelines for stability via EMA and early stopping.

Abstract

In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through test-time optimization of a learnable latent variable. We observe that attention, as the core module of MLLMs, connects text prompt tokens and visual tokens, ultimately determining the final results. Our approach involves adjusting visual tokens from the MLP output at test time, controlling the attention response to ensure text prompt tokens attend to visual tokens in referring regions. We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point. The results demonstrate that our method exhibits out-of-domain generalization and interpretability.
Paper Structure (50 sections, 9 equations, 14 figures, 8 tables)

This paper contains 50 sections, 9 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Comparison between the training method and our training-free method. The training method typically requires a large amount of in-domain data for training and cannot generalize to out-of-domain prompts. In contrast, our method can easily adapt to prompts from a new domain in a training-free manner.
  • Figure 2: The attention maps in various layers of the MLLMs, with the numbers indicating the respective layer indices. The top line visualizes the attention between the prompt token "hat" and the visual tokens, while the bottom line visualizes the attention between the context token (mentioned in Sec. \ref{['sec:ene']}) and the visual tokens.
  • Figure 3: Manipulating attention with various methods: (a), (b), and (c) demonstrate the manipulation of the attention map by adding an adjustment coefficient $\eta$ on the attention map in the first step during the model inference. (d) illustrates the step-by-step editing approach. (e) showcases that optimizing a learnable context tokens (mentioned in Sec. \ref{['sec:ene']}) instead of visual tokens, while (f) presents the results of our method optimizing the learnable latent variable.
  • Figure 4: The overview of our method. With the provided visual prompt, we convert it into a mask, and compute the mask-based energy function between the mask and the pooled attention map. During the inference process, we conduct backpropagation to optimize a learnable latent variable. This process is executed at the 0-th step of model inference and iterated $T$ times.
  • Figure 5: The examples of referring MLLM with four types of visual prompt, including box (a), mask (b), scribble (c) and point (d). The correct referring expressions are marked in green, incorrect referring expressions are marked in red, and hallucinated expressions are marked in orange. Compared to the baseline model, our method enhances interpretability and controllability with visual prompts, while also helping the model mitigate hallucination issues.
  • ...and 9 more figures