Table of Contents
Fetching ...

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza

TL;DR

TTRV introduces the first test-time reinforcement learning framework for vision-language models, enabling unsupervised online adaptation by deriving rewards directly from unlabeled test data. By combining a frequency-based reward with entropy-based diversity control within Group Relative Policy Optimization, it achieves robust, data-efficient improvements across 16 benchmarks for object recognition and VQA, often rivaling strong proprietary models. The approach demonstrates strong cross-dataset generalization and remains effective even in extremely data-scarce settings, highlighting the potential of test-time RL to bridge pre-training and deployment without labeled feedback.

Abstract

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

TTRV: Test-Time Reinforcement Learning for Vision Language Models

TL;DR

TTRV introduces the first test-time reinforcement learning framework for vision-language models, enabling unsupervised online adaptation by deriving rewards directly from unlabeled test data. By combining a frequency-based reward with entropy-based diversity control within Group Relative Policy Optimization, it achieves robust, data-efficient improvements across 16 benchmarks for object recognition and VQA, often rivaling strong proprietary models. The approach demonstrates strong cross-dataset generalization and remains effective even in extremely data-scarce settings, highlighting the potential of test-time RL to bridge pre-training and deployment without labeled feedback.

Abstract

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

Paper Structure

This paper contains 30 sections, 9 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Test-Time RL for VLMs. (left) Unlike prior methods that require pre-training splits and post-training via Supervised Finetuning (SFT) or Reinforcement Learning (RL), our approach extracts reward signals directly at test time from unlabeled data. The reward combines (1) frequency-based signals and (2) diversity control, allowing the model to adapt online and improve downstream vision performance without any labeled data. (right) Test accuracy increases while entropy of the output logits decreases, showing that the model becomes more accurate and less uncertain as test-time RL progresses. The solid lines represent the mean, and shaded regions represent the variance of results obtained over $5$ independent runs. The dataset is Resics45 resisc, task is object recognition, and the model is InternVL-3-2B internvl3.
  • Figure 2: Overview of TTRV. For each prompt $x$, the VLM generates $N$ candidate responses $\{\hat{y}_1, \ldots, \hat{y}_N\}$ from its policy $\pi_\theta(\cdot|x)$. These samples induce an empirical distribution over the unique outputs $\{\tilde{y}_1, \ldots, \tilde{y}_M\}$, from which two reward signals are derived: (i) a frequency-based reward, where each response $y_j$ is rewarded in proportion to how often its output occurs among the $N$ responses (i.e., its empirical probability in the distribution), and (ii) a diversity control reward, computed from the distribution to regulate diversity and encourage convergence. The final reward is the weighted combination of these terms, which is used to update the policy via GRPO.
  • Figure 3: Cross-dataset Generalization. Top-1 accuracy (%) achieved by employing TTRV on a base dataset using InternVL3-2B and evaluating on a target dataset from a completely different domain. The results highlight that TTRV enhances core abilities of the model.