Table of Contents
Fetching ...

Test-Time Warmup for Multimodal Large Language Models

Nikita Rajaneesh, Thomas Zollo, Richard Zemel

TL;DR

The paper tackles the data bottleneck in multimodal large language models by introducing Test-Time Warmup (TTW), a per-instance adaptation that uses weakly supervised auxiliary tasks to refine image representations during inference without altering global parameters. TTW generates multiple caption-like outputs per auxiliary task, filters them with CLIP, and performs gradient updates on the LLM and connector while keeping the vision encoder fixed, then discards updates after solving the test case. Empirically, TTW improves accuracy on MMMU, GQA, and VQA-Rad (4.03%, 1.63%, 5.28% respectively) and shows modest gains on VQAv2, indicating enhanced perceptual reasoning without requiring new labels. The work discusses ablations, limitations, and future directions including LoRA, GRPO, data-driven auxiliary task selection, and safety considerations.

Abstract

Multimodal Large Language Models (MLLMs) hold great promise for advanced reasoning at the intersection of text and images, yet they have not fully realized this potential. MLLMs typically integrate an LLM, a vision encoder, and a connector that maps the vision encoder's embeddings into the LLM's text embedding space. Although each component is pretrained on massive datasets with billions of samples, the entire multimodal model is typically trained on only thousands (or a few million) samples, which can result in weak performance on complex reasoning tasks. To address these shortcomings, instead of relying on extensive labeled datasets for fine-tuning, we propose a Test-Time Warmup method that adapts the MLLM per test instance by leveraging data from weakly supervised auxiliary tasks. With our approach, we observe a relative performance improvement of 4.03% on MMMU, 5.28% on VQA-Rad, and 1.63% on GQA on the Llama-Vision-Instruct model. Our method demonstrates that 'warming up' before inference can enhance MLLMs' robustness across diverse reasoning tasks.

Test-Time Warmup for Multimodal Large Language Models

TL;DR

The paper tackles the data bottleneck in multimodal large language models by introducing Test-Time Warmup (TTW), a per-instance adaptation that uses weakly supervised auxiliary tasks to refine image representations during inference without altering global parameters. TTW generates multiple caption-like outputs per auxiliary task, filters them with CLIP, and performs gradient updates on the LLM and connector while keeping the vision encoder fixed, then discards updates after solving the test case. Empirically, TTW improves accuracy on MMMU, GQA, and VQA-Rad (4.03%, 1.63%, 5.28% respectively) and shows modest gains on VQAv2, indicating enhanced perceptual reasoning without requiring new labels. The work discusses ablations, limitations, and future directions including LoRA, GRPO, data-driven auxiliary task selection, and safety considerations.

Abstract

Multimodal Large Language Models (MLLMs) hold great promise for advanced reasoning at the intersection of text and images, yet they have not fully realized this potential. MLLMs typically integrate an LLM, a vision encoder, and a connector that maps the vision encoder's embeddings into the LLM's text embedding space. Although each component is pretrained on massive datasets with billions of samples, the entire multimodal model is typically trained on only thousands (or a few million) samples, which can result in weak performance on complex reasoning tasks. To address these shortcomings, instead of relying on extensive labeled datasets for fine-tuning, we propose a Test-Time Warmup method that adapts the MLLM per test instance by leveraging data from weakly supervised auxiliary tasks. With our approach, we observe a relative performance improvement of 4.03% on MMMU, 5.28% on VQA-Rad, and 1.63% on GQA on the Llama-Vision-Instruct model. Our method demonstrates that 'warming up' before inference can enhance MLLMs' robustness across diverse reasoning tasks.

Paper Structure

This paper contains 40 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An example demonstrating Test-Time Warmup improving a model's attention to detail (a crucial aspect of perceptual reasoning). In step (1), per auxiliary task prompt listed in Figure 2, we generate 10 caption-like outputs. In this example, we show the outputs generated for the object detection auxiliary task. Then, we use the maximum CLIP score to choose one caption. Here, the caption with more details is chosen because it's more aligned with the image. Then in step (2), we perform gradient steps on the chosen caption-like outputs for $N$ auxiliary task prompts ($N=10$ as shown in Figure 2). In doing so, we enforce the model to pay attention to all the objects, including the mask, in the image. For inference, in step (3), the "warmed-up" MLLM is better informed to answer the question because it has paid attention to the mask.
  • Figure 2: Each prompt represents a unique auxiliary task because the prompts are designed to elicit different kinds of information in the image from the MLLM. These auxiliary tasks are not specific to any downstream target task and they aim to refine the target image's representations in the MLLM.
  • Figure 3: Examples in GQA of where our "warmed up" MLLM answers correctly while the base MLLM fails. Below each example we show the auxiliary prompt response that was the most relevant to the question and potentially aided the model for better reasoning. For the left‑most image, crowded with objects and requiring fine‑grained attention, the auxiliary response adds scene context that helps the model infer there are no small spoons. In the centre and right‑most images, the auxiliary response explicitly provides information needed to answer the question, directly steering the model toward the correct answer.
  • Figure 4: Examples in VQA-Rad of where our "warmed up" MLLM answers correctly while the base MLLM fails. Below each example we show the auxiliary prompt response that was the most relevant to the question and potentially aided the model for better reasoning. For the left‑most and centre images, the auxiliary response delivers information directly needed to answer the test question. Although the auxiliary response for the far‑right image is not directly tied to the question, including it in Test-Time Warmup compels the model to examine the image carefully beforehand, helping it avoid hallucinating findings that are not present.
  • Figure 5: This figure motivates using Test-Time Warmup for AI safety in MLLMs. Both the prompts shown in this figure are only offensive conditional on the image on the left because they prompt the model to make assumptions about traditional attire. For the image with children in Halloween costumes, the first question would be appropriate and for the image in the bottom right the second question would be appropriate because they are celebrating holi, a religious event.
  • ...and 5 more figures