Table of Contents
Fetching ...

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

Soumya Shamarao Jahagirdar, Edson Araujo, Anna Kukleva, M. Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R. Glass, Hilde Kuehne

Abstract

Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

Abstract

Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.

Paper Structure

This paper contains 36 sections, 8 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: TTA-Vid adapts vision-language models at test-time by sampling multiple frame subsets, enforcing majority-consistency among generated answers, and updating a frame-importance distribution via a multi-armed bandit. This allows a test-time adaptation without labels while selecting frames most relevant for reasoning.
  • Figure 2: Overview of TTA-Vid: Our method performs test-time adaptation of model parameters through a batch-aware reinforcement learning objective and adaptively selects the most informative frames using a multi-armed bandit approach. Both components leverage a shared reward signal computed across multiple video frame subsets, enabling the model to jointly learn what to predict and which frames to attend to.
  • Figure 3: Qualitative comparison of frame selection strategies. Random sampling (left) versus our learned selection (right) on two VideoMMMU (Perception) examples. (1): Musical question which requires the model to identify the sequence of notes displayed at a certain timestamp of the video. Our method selects the critical frame (correct answer I), while random sampling misses it (predicts G). (2): Accounting question requiring localization of the value. Our method identifies the highlighted value in blue to be (0.58, option G), while random sampling fails (predicts 2.5, option E).
  • Figure 4: Performance per category: LongVideoBench Orange and blue dotted lines correspond to the overall performance of base model and TTA-Vid respectively. LongVideoBench dataset has multiple categories. The highest gain over the base model are observed for the following categories: Computer Science, STEM, New programs.
  • Figure 5: Performance based on duration of the video: LongVideoBench Orange and blue dotted lines correspond to the overall performance of base model and TTA-Vid respectively. The highest gains are observed for videos with duration of 1-3 minutes. TTA-Vid also works well with very long videos, i.e. over 30 minutes.
  • ...and 3 more figures