Table of Contents
Fetching ...

Learning Situated Awareness in the Real World

Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang

TL;DR

SAW-Bench is positioned as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics, and revealing a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash.

Abstract

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

Learning Situated Awareness in the Real World

TL;DR

SAW-Bench is positioned as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics, and revealing a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash.

Abstract

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.
Paper Structure (70 sections, 12 figures, 11 tables)

This paper contains 70 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: (Left) Situated Awareness in the Real World. A real-world example in which the observer walks along a straight trajectory while frequently rotating their head. The resulting egocentric video exhibits substantial camera orientation changes despite linear translational motion. (Right) Reasoning Task Performance. Radar plot compares human performance with representative MFMs across six situated awareness tasks in SAW-Bench.
  • Figure 2: Overview of SAW-Bench. We illustrate six representative tasks (§ \ref{['subsec:tasks']}) evaluating different aspects of situated awareness: Self-Localization, Relative Direction, Route Shape, Reverse Route Plan, Spatial Memory, and Spatial Affordance. During data collection, human annotators follow pre-defined trajectories when recording egocentric videos (§ \ref{['subsec:collection']}); these trajectories are visualized as purple dashed arrows. For all tasks, the model input consists solely of egocentric video without access to any bird’s-eye or global scene representations; the visualizations shown here are provided for illustrative purposes only.
  • Figure 3: Benchmark Curation Pipeline. We first pre-define 37 camera trajectories and annotate their metadata (details are provided in § \ref{['app:annotation_details']}). Human video collectors then record egocentric videos by following these trajectories in selected scenes. Low-quality recordings are filtered and re-captured to ensure consistent video quality.
  • Figure 4: Error Case Analysis. (Left) Reverse Route Plan: Gemini 3 Flash successfully reconstructs the return path by systematically inverting the actions from the forward pass. In contrast, Qwen3-VL 235B attempts to exploit a shortcut between the first and last frames, thereby neglecting the transitive dynamics and spatial transformations occurring throughout the frame sequence. (Right) Route Shape: While both Gemini 3 Flash and Qwen3-VL 235B effectively identify camera rotations, they falsely integrate these rotational pans into the observer's physical movement trajectory, leading to incorrect shape understanding.
  • Figure 5: Camera Rotation and Observer's Trajectory. Visualization of three controlled scenarios used to isolate the impact of head rotation on Route Shape. (Left) a straight path with steady head orientation; (Middle) the same straight path with frequent left-and-right head rotations; and (Right) a true zigzag trajectory.
  • ...and 7 more figures