Table of Contents
Fetching ...

Unfolding Videos Dynamics via Taylor Expansion

Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, Zhongming Liu

TL;DR

ViDiDi addresses the challenge of learning motion-aware video representations under self-supervision by treating videos as continuous processes and unfolding them with a Taylor expansion across 0th, 1st, and 2nd order derivatives. A balanced alternating learning schedule aligns representations across these derivative views within a single encoder, and ViDiDi is designed to plug into existing instance-discrimination SSL frameworks such as SimCLR, BYOL, and VICReg. Empirically, ViDiDi yields significant improvements in video retrieval, action recognition, and action detection, with strong data efficiency and cross-backbone generalization. The approach provides a physics-inspired, principled means to emphasize dynamics over static content and suggests avenues for extending Taylor-based dynamics to other modalities and larger-scale architectures.

Abstract

Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.

Unfolding Videos Dynamics via Taylor Expansion

TL;DR

ViDiDi addresses the challenge of learning motion-aware video representations under self-supervision by treating videos as continuous processes and unfolding them with a Taylor expansion across 0th, 1st, and 2nd order derivatives. A balanced alternating learning schedule aligns representations across these derivative views within a single encoder, and ViDiDi is designed to plug into existing instance-discrimination SSL frameworks such as SimCLR, BYOL, and VICReg. Empirically, ViDiDi yields significant improvements in video retrieval, action recognition, and action detection, with strong data efficiency and cross-backbone generalization. The approach provides a physics-inspired, principled means to emphasize dynamics over static content and suggests avenues for extending Taylor-based dynamics to other modalities and larger-scale architectures.

Abstract

Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.
Paper Structure (43 sections, 9 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 43 sections, 9 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: Method Overview. (a) The ViDiDi method evaluates temporal derivatives of video clips through Taylor expansion, uses the same encoder to embed them into the latent space, and converges their representations for the same video while diverging those from different videos. (b) Pretraining via ViDiDi enhances existing instance discrimination methods significantly on action recognition.
  • Figure 2: A thought experiment on physical motion. The Taylor series expansion projects the dynamic process of a free fall motion onto three views expressed in terms of the height, velocity, and acceleration. The reverse inference of the common causes of the height, velocity, and acceleration leads to the encoding of the gravity $\bm g$ - the only variable pertaining to all three views, instead of unrelated static latents, $\bm {y_0}$ and $\bm {v_0}$.
  • Figure 3: Illustration of the ViDiDi framework. For a batch of videos $\bm I$, we do two spatio-temporal augmentations $\bm \tau$ and $\bm \tau^{\prime}$ to obtain two batches of clips: $\bm V$ and $\bm V'$. These clips are evaluated for the $0^{th}$, $1^{st}$, or $2^{nd}$ order temporal derivatives. Such derivatives are further selected (denoted as $\bm X$ and $\bm X'$) via a balanced alternating learning strategy described in \ref{['alg:schedule']}. $\bm X$ and $\bm X'$ are the inputs to the video encoder in a 2-stream SSL framework such as SimCLR, BYOL, and VICReg for learning through instance discrimination. $\bm f$ is the video encoder, $\bm h$ is the MLP projection head, and $\bm Z$ and $\bm Z'$ are the encoded embeddings.
  • Figure 4: Silhouette scores and t-SNE of top 5 classes from VICReg (left) and ViDiDi-VIC (right).
  • Figure 5: Spatiotemporal attention on UCF and HMDB51. Left: Original frames. Middle: Attention from VIC. Right: Attention from ViDiDi-VIC.
  • ...and 10 more figures