Table of Contents
Fetching ...

STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, Mohsen Fayyaz

Abstract

We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.

STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

Abstract

We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.

Paper Structure

This paper contains 12 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: STRIVE prevents advantage collapse in video reinforcement learning. (a) Standard policy optimization samples multiple textual responses from a single, fixed video representation. When these generated responses lack diversity, the reward variance can drop to zero ($\text{Var}(r) = 0$), stalling the gradient of the objective $J(\theta)$. (b) Our proposed STRIVE framework employs Variant Construction to systematically generate multiple spatiotemporal visual variants of the input video. By computing joint advantages across both textual generations and varied visual perspectives, STRIVE empirically maintains a consistently positive reward variance ($\text{Var}(r) > 0$), delivering informative updates to the policy $J(\theta)$.
  • Figure 2: Overview of our STRIVE Framework. Given a video input $v$ and question $q$, a set of transformations $\{\mathcal{T}_i\}_{i=1}^M$ is applied to generate video variants $\{\tilde{v}_i\}_{i=1}^M$. Each variant is paired with the question to form input tuples $(\tilde{v}_i, q)$. The policy model produces $G$ textual responses for each of the $M$ variants, resulting in a $G \times M$ matrix of responses $\{o_{i,j}\}_{i=1,j=1}^{G,M}$. These are evaluated by a reward model to obtain $\{r_{i,j}\}_{i=1,j=1}^{G,M}$. Advantages are then jointly normalized across the entire $G \times M$ pool, enriching reward variance.
  • Figure 3: Empirical analysis of optimization dynamics.(a) Standard deviation of rewards, demonstrating that STRIVE maintains consistently higher variance than the GRPO baseline. (b) Fraction of zero-advantage updates, illustrating STRIVE's ability to prevent gradient starvation. (c) Gradient norm over training steps. Semi-transparent lines show the raw gradient values at each step to highlight the severe optimization instability and chaotic spikes of standard GRPO, while bold solid lines denote the smoothed trajectory.
  • Figure 4: Qualitative comparison between STRIVE and the standard GRPO baseline. (Top) STRIVE successfully isolates the relevant driving frames to correctly answer a question about a person's mode of transport, whereas GRPO is distracted by irrelevant walking scenes. (Bottom) STRIVE accurately determines the strict temporal sequence of objects, demonstrating how question-guided variant construction improves fine-grained spatiotemporal reasoning and prevents reliance on flawed priors.