Video Understanding: From Geometry and Semantics to Unified Models

Zhaochong An; Zirui Li; Mingqiao Ye; Feng Qiao; Jiaang Li; Zongwei Wu; Vishal Thengane; Chengzu Li; Lei Li; Luc Van Gool; Guolei Sun; Serge Belongie

Video Understanding: From Geometry and Semantics to Unified Models

Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, Guolei Sun, Serge Belongie

Abstract

Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.

Video Understanding: From Geometry and Semantics to Unified Models

Abstract

Paper Structure (51 sections, 8 figures, 9 tables)

This paper contains 51 sections, 8 figures, 9 tables.

Introduction
Low-level video geometry understanding.
High-level semantic understanding.
Unified video understanding models.
Survey structure.
Low-level Video Geometry Understanding
Video depth estimation.
Inference-time alignment.
Feed-forward prediction.
Diffusion-based approaches.
Summary.
Camera pose estimation
Correspondence-and-solver pipelines.
Pose regression: absolute and relative.
Summary.
...and 36 more sections

Figures (8)

Figure 1: Classification of video understanding by level. This survey organizes video understanding methods into three categories: (1) low-level geometry understanding, (2) high-level semantic understanding, and (3) unified video understanding. The right diagram illustrates the conceptual relationship among the three levels.
Figure 2: Comparison of low-level video geometry understanding tasks: video depth estimation (left), camera pose estimation (middle), and optical flow/point tracking (right). While all three tasks take video frames as input, they recover different geometric quantities, namely scene depth, camera motion, and temporal correspondences across frames.
Figure 3: Comparison between different joint feed-forward geometry models. (a) Pairwise processing methods weinzaepfel2022crocowang2024dust3r predict view-consistent representations (e.g., point maps) from image pairs, typically operating on independent frame pairs and relying on later aggregation or optimization to get multi-view geometry information. (b) Multi-view-input, multi-output methods wang2025vggtcut3r jointly reason over a set of frames in a single forward pass, using global attention or memory to directly predict multiple geometric primitives with cross-view consistency.
Figure 4: Overview of video segmentation. We categorize video segmentation methods into three paradigms based on how semantic categories are specified. (a) Video class-aware segmentation assumes a fixed, predefined label set and includes video semantic, instance, and panoptic segmentation. (b) Video open-vocabulary segmentation extends class-aware settings by leveraging language embeddings to segment both seen and unseen categories. (c) Video class-agnostic segmentation removes semantic labels altogether, instead segmenting and tracking objects using visual or concept-level prompts such as clicks, boxes, or masks.
Figure 5: Overview of video object tracking. (a) Siamese tracking performs target localization by matching a template extracted from the target object against a search region in the current frame. (b) Sequence-level tracking extends this formulation by modeling temporal dependencies across multiple consecutive frames, thereby improving tracking continuity and robustness. (c) Multimodal tracking further incorporates complementary modalities, such as visual, textual, or auxiliary sensory cues, to enhance target representation and achieve more reliable tracking in challenging scenarios.
...and 3 more figures

Video Understanding: From Geometry and Semantics to Unified Models

Abstract

Video Understanding: From Geometry and Semantics to Unified Models

Authors

Abstract

Table of Contents

Figures (8)