SemanticMoments: Training-Free Motion Similarity via Third Moment Features
Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady
TL;DR
The paper tackles motion-centric video similarity, arguing that current representations overly rely on static appearance. It presents SemanticMoments, a training-free method that encodes semantic motion by applying temporal statistics to patch-level features from pretrained backbones, formalized in the M+ framework with $M^{(k)}$ descriptors and moment embeddings $\phi_{video}$. The authors introduce SimMotion-Synthetic and SimMotion-Real benchmarks to rigorously evaluate motion alignment, and show that higher-order temporal moments improve motion clustering and retrieval across multiple backbones, outperforming RGB-, flow-, and text-supervised baselines. This approach offers a scalable, perceptually grounded foundation for motion-aware video understanding without extra training, highlighting both practical gains and remaining gaps in real-world motion perception.
Abstract
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
