Table of Contents
Fetching ...

Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han

TL;DR

This work investigates whether video content can be inferred from camera trajectories alone, introducing CamFormer, a dedicated trajectory encoder trained with contrastive learning to align pose trajectories with text descriptions. A contextualized encoding strategy extends temporal context to disambiguate local actions, enabling robust cross-domain analysis in both egocentric and exocentric settings. Across ten downstream tasks and five datasets, CamFormer delivers consistent gains and can surpass some vision-based baselines when used alone, while providing complementary benefits when fused with vision. The approach demonstrates robustness to diverse pose estimators and highlights camera trajectory as a lightweight, privacy-preserving modality for semantic video perception with practical implications for retrieval, classification, and temporal analysis.

Abstract

Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

Seeing without Pixels: Perception from Camera Trajectories

TL;DR

This work investigates whether video content can be inferred from camera trajectories alone, introducing CamFormer, a dedicated trajectory encoder trained with contrastive learning to align pose trajectories with text descriptions. A contextualized encoding strategy extends temporal context to disambiguate local actions, enabling robust cross-domain analysis in both egocentric and exocentric settings. Across ten downstream tasks and five datasets, CamFormer delivers consistent gains and can surpass some vision-based baselines when used alone, while providing complementary benefits when fused with vision. The approach demonstrates robustness to diverse pose estimators and highlights camera trajectory as a lightweight, privacy-preserving modality for semantic video perception with practical implications for retrieval, classification, and temporal analysis.

Abstract

Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

Paper Structure

This paper contains 42 sections, 1 equation, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Can you guess which action goes with which camera pose trajectory? In this paper, we find that camera trajectory carries rich information about the video's content, in both egocentric and exocentric settings. Answers are given in the next page.
  • Figure 2: Unlocking Semantic Information Hidden in Camera Trajectories. (a) We propose contrastive pre-training on paired (trajectory, text) data. Our model, CamFormer, is trained to map camera trajectories into a joint semantic space, aligning them with natural language. (b) We propose contextualized trajectory encoding that incorporates extended temporal context to disambiguate the local action.
  • Figure 3: (a) Quantitative Results Overview: we summarize CamFormer's performance against base methods / models on 10 downstream tasks across 5 datasets, demonstrating its consistent performance advantages; (b) A PCA visualization of CamFormer embeddings on unseen Ego-Exo4D trajectories, colored by the dataset's 8 activity labels. (Note: CamFormer only takes the trajectory as input; video clips and text are shown for interpretation only); (c) Per-class activity classification accuracy plot on Ego-Exo4D reveals a performance dichotomy: CamFormer excels at physical activities but is less effective on procedural ones with more subtle camera motions.
  • Figure 4: Qualitative Text Retrieval Results on egocentric Ego-Exo4D (up) and exocentric DynPose-100K (bottom). Up: A clear downward pose trajectory disambiguates the action of landing, where visual cues are subtle. Bottom: A circling trajectory, common for capturing a scene overview, is correctly associated with the high-level scene description. See Supp. for more qualitatives.
  • Figure 5: Scene Attribute Classification Results on DynPose-100K. The results reveal a clear spectrum of what can and cannot be inferred from an observer's camera trajectory.
  • ...and 6 more figures