Seeing without Pixels: Perception from Camera Trajectories
Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
TL;DR
This work investigates whether video content can be inferred from camera trajectories alone, introducing CamFormer, a dedicated trajectory encoder trained with contrastive learning to align pose trajectories with text descriptions. A contextualized encoding strategy extends temporal context to disambiguate local actions, enabling robust cross-domain analysis in both egocentric and exocentric settings. Across ten downstream tasks and five datasets, CamFormer delivers consistent gains and can surpass some vision-based baselines when used alone, while providing complementary benefits when fused with vision. The approach demonstrates robustness to diverse pose estimators and highlights camera trajectory as a lightweight, privacy-preserving modality for semantic video perception with practical implications for retrieval, classification, and temporal analysis.
Abstract
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
