Test-Time Computing for Referring Multimodal Large Language Models

Mingrui Wu; Hao Chen; Jiayi Ji; Xiaoshuai Sun; Zhiyuan Liu; Liujuan Cao; Ming-Ming Cheng; Rongrong Ji

Test-Time Computing for Referring Multimodal Large Language Models

Mingrui Wu, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Zhiyuan Liu, Liujuan Cao, Ming-Ming Cheng, Rongrong Ji

Abstract

We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.

Test-Time Computing for Referring Multimodal Large Language Models

Abstract

Paper Structure (35 sections, 9 equations, 10 figures, 7 tables)

This paper contains 35 sections, 9 equations, 10 figures, 7 tables.

Introduction
Related Work
MLLMs
Referring MLLMs
Training-free Control in Text-to-Image
Visual Prompt
Background
Multimodal Large Language Models (MLLMs)
Referring in MLLMs
Training-based Approaches
Training-Free Referring
Method
Analysis of the Attention in LVLMs
ControlMLLM: Manipulating Attention via Latent Variable Learning
Hard Mask-based Energy Function
...and 20 more sections

Figures (10)

Figure 1: Comparison between the training method and our test-time computing method. The training method typically requires a large amount of in-domain data for training and cannot generalize to out-of-domain prompts. In contrast, our method can easily adapt to prompts from a new domain in a test-time computing manner.
Figure 2: The attention maps in various layers of the MLLMs, with the numbers indicating the respective layer indices. The top line visualizes the attention between the prompt token "hat" and the visual tokens, while the bottom line visualizes the attention between the context token (mentioned in Sec. \ref{['sec:ene']}) and the visual tokens.
Figure 3: Manipulating attention with various methods: (a), (b), and (c) demonstrate the manipulation of the attention map by adding an adjustment coefficient $\eta$ on the attention map in the first step during the model inference. (d) illustrates the step-by-step editing approach. (e) showcases that optimizing a learnable context tokens (mentioned in Sec. \ref{['sec:ene']}) instead of visual tokens, while (f) presents the results of our method optimizing the learnable latent variable.
Figure 4: The overview of ControlMLLM. With the provided visual prompt, we convert it into a mask, and compute the mask-based energy function between the mask and the pooled attention map. During the inference process, we conduct backpropagation to optimize a learnable latent variable. This process is executed at the 0-th step of model inference and iterated $T$ times.
Figure 5: The text tokens' contribution to attention maps, where the attention is focused on the answer-start token ($|$START$|$).
...and 5 more figures

Test-Time Computing for Referring Multimodal Large Language Models

Abstract

Test-Time Computing for Referring Multimodal Large Language Models

Authors

Abstract

Table of Contents

Figures (10)