VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Mohamed Eltahir; Ali Habibullah; Yazan Alshoibi; Lama Ayash; Tanveer Hussain; Naeemullah Khan

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan

Abstract

Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Abstract

Paper Structure (49 sections, 9 figures, 8 tables)

This paper contains 49 sections, 9 figures, 8 tables.

Introduction
From representation to reasoning.
Related Work
Long-Form Video Understanding.
Caption-Based Approaches.
Agentic, Hierarchical, and Memory Approaches.
Long Context as the Core Challenge.
Environment Budgeting vs. Prior Compute Adaptation.
What Is Missing?
Methodology
VideoAtlas
Hierarchical Grid.
Action Space.
Memory.
Formal Environment Definition.
...and 34 more sections

Figures (9)

Figure 1: Logarithmic compute scaling with video duration. Video-RLM's hierarchical grid grows sub-linearly ($O(\log T)$), requiring up to 9.7$\times$ fewer tokens than linear-scaling baselines. A uniform VLM maxes out its 256K context trading off sampled frame count with resolution.
Figure 2: The VideoAtlas Environment. (Left) The state space is a hierarchical grid stack $S_0, S_1, \ldots, S_D$, where $S_0$ is the root grid covering the entire video of duration $T$. Each grid has $K^2$ cells. Deeper levels $d$ provide finer temporal resolution $\Delta t_d = T/K^{2(d+1)}$. (Top Right) The discrete action space $\mathcal{A}$ is divided into navigation (e.g., Expand to $S_{t+1}$), perception, and commit actions. (Bottom Right) The visual scratchpad memory $\mathcal{M}^+$ accumulates multimodal evidence (images, timestamps, QA pairs) across exploration rounds.
Figure 3: Video-RLM overview. The query is converted into a search task. In each round $r$, the Master examines the root grid $S_0$ (with dead zones masked) and the scratchpad $\mathcal{M}^+$, then assigns promising cells to Workers. Each Worker autonomously explores its assigned region via navigation, perception, and commit actions. After all Workers return, $\mathcal{M}^+$ and $\mathcal{M}^-$ are updated. The Master performs an uncertainty analysis: if evidence is sufficient, the final answer is produced. Otherwise, a new round begins.
Figure 4: (a) Environment budgeting: accuracy and tokens vs. max depth on subset of LVB-10hr (temporal span annotated). Green: optimal depth (first sub-second layer). (b) Adaptive compute: average tokens scale with evidence spread without ground-truth supervision.
Figure 5: Wall-clock time (normalized to equal workload) vs. number of workers 30 questions sampled from LVB-10hr. Accuracy (annotated) remains stable across all configurations.
...and 4 more figures

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Abstract

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Authors

Abstract

Table of Contents

Figures (9)