Table of Contents
Fetching ...

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady

TL;DR

The paper tackles motion-centric video similarity, arguing that current representations overly rely on static appearance. It presents SemanticMoments, a training-free method that encodes semantic motion by applying temporal statistics to patch-level features from pretrained backbones, formalized in the M+ framework with $M^{(k)}$ descriptors and moment embeddings $\phi_{video}$. The authors introduce SimMotion-Synthetic and SimMotion-Real benchmarks to rigorously evaluate motion alignment, and show that higher-order temporal moments improve motion clustering and retrieval across multiple backbones, outperforming RGB-, flow-, and text-supervised baselines. This approach offers a scalable, perceptually grounded foundation for motion-aware video understanding without extra training, highlighting both practical gains and remaining gaps in real-world motion perception.

Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

TL;DR

The paper tackles motion-centric video similarity, arguing that current representations overly rely on static appearance. It presents SemanticMoments, a training-free method that encodes semantic motion by applying temporal statistics to patch-level features from pretrained backbones, formalized in the M+ framework with descriptors and moment embeddings . The authors introduce SimMotion-Synthetic and SimMotion-Real benchmarks to rigorously evaluate motion alignment, and show that higher-order temporal moments improve motion clustering and retrieval across multiple backbones, outperforming RGB-, flow-, and text-supervised baselines. This approach offers a scalable, perceptually grounded foundation for motion-aware video understanding without extra training, highlighting both practical gains and remaining gaps in real-world motion perception.

Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
Paper Structure (21 sections, 4 equations, 6 figures, 5 tables)

This paper contains 21 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Motion-centric retrieval with Semantic Moments. Existing video-similarity methods over-rely on static appearance and scene context, overlooking temporal dynamics. Our approach retrieves clips that match the semantic motion. We retrieve the drinking-coffee motion across identities, disentangling motion from appearance, while all baselines similarly return look-alikes and miss the action.
  • Figure 2: Current benchmarks are appearance-centric. We show random frames from popular video-retrieval datasets. In many cases, static objects (e.g., a cello, a razor) or scene context (e.g., a basketball court) suffice to identify the action label (e.g., Playing Cello, Shaving Beard) without observing motion. This bias enables high accuracy from purely appearance-based cues, discouraging models from learning true temporal dynamics.
  • Figure 3: Controlled variation in SimMotion-Synthetic. We visualize sample pairs from the five distinct categories in our benchmark. From left to right: Static Object (background varies), Dynamic Appearance (subject clothing/attributes vary), Dynamic Object (subject identity varies), View (camera angle varies), and Scene Style (rendering style varies). In each column, the top and bottom videos are temporally synchronized and share identical motion dynamics, differing only in the specified visual factor.
  • Figure 4: Motion-focused similarity with moment statistics. (a) Appearance-altered edits preserve the same underlying motion for each motion group $m_{i}$, while changing visual style. (b) Baseline embeddings yield similarity heatmaps that are sensitive to appearance rather than motion. (c) Our moment-based embedding (using the first three moments over patch features) produces clearer motion-consistent clusters (corresponding to shared motion $m_{i}$) than global mean pooling. Brighter cells indicate higher cosine similarity.
  • Figure 5: SemanticMoments pipeline. Patch-wise features are extracted per frame using a pretrained embedder (e.g., DINO) and summarized over time using the first three temporal moments (mean, variance, and skewness). Spatial aggregation yields one descriptor per moment, which are combined into a global motion-centric video embedding.
  • ...and 1 more figures