Table of Contents
Fetching ...

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Philip Schroeder, Ondrej Biza, Thomas Weng, Hongyin Luo, James Glass

TL;DR

ROVER tackles the challenge of long-horizon video reasoning in embodied tasks by recursively decomposing a task into subtasks and applying subtask-specific, temporally localized reasoning with a sliding context window. The approach maintains global task context while constraining the model's reasoning to a small, relevant frame window, which reduces hallucinations and improves efficiency, yielding linear scaling with video length. A large, perturbation-based RoboCasa-derived dataset with ground-truth progress signals enables evaluation across frame-level progress estimation, frame-level NL reasoning, and video QA, where ROVER consistently outperforms strong baselines. The work also assesses robustness across camera views, backbone models, and real-world OpenX Embodiment data, underscoring ROVer's practical applicability for scalable embodied reasoning with vision-language models.

Abstract

Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER's time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: https://rover-vlm.github.io

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

TL;DR

ROVER tackles the challenge of long-horizon video reasoning in embodied tasks by recursively decomposing a task into subtasks and applying subtask-specific, temporally localized reasoning with a sliding context window. The approach maintains global task context while constraining the model's reasoning to a small, relevant frame window, which reduces hallucinations and improves efficiency, yielding linear scaling with video length. A large, perturbation-based RoboCasa-derived dataset with ground-truth progress signals enables evaluation across frame-level progress estimation, frame-level NL reasoning, and video QA, where ROVER consistently outperforms strong baselines. The work also assesses robustness across camera views, backbone models, and real-world OpenX Embodiment data, underscoring ROVer's practical applicability for scalable embodied reasoning with vision-language models.

Abstract

Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER's time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: https://rover-vlm.github.io

Paper Structure

This paper contains 47 sections, 6 equations, 30 figures, 14 tables, 1 algorithm.

Figures (30)

  • Figure 1: ROVER is a recursive framework for reasoning over camera video that decomposes a task into subtasks to maintain a compact temporal context, improving reasoning accuracy and efficiency.
  • Figure 2: Generating diverse video trajectories by inserting random deviations during expert path.
  • Figure 3: Mean and standard error of correlation between ground-truth values and progress values predicted for all videos (a) and stratified across trajectory level (b). Highest-level videos for each task show near-expert completion. Amount of non-expert behavior increases as level decreases.
  • Figure 4: ROVER exhibits more accurate reasoning and progress prediction during non-expert states.
  • Figure 5: Mean and standard error of reasoning error rate (percentage of frames model states something that is verifiably wrong) for all videos (a) and stratified across trajectory level (b).
  • ...and 25 more figures