Table of Contents
Fetching ...

It's a Matter of Time: Three Lessons on Long-Term Motion for Perception

Willem Davison, Xinyue Hao, Laura Sevilla-Lara

TL;DR

The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone.

Abstract

Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.

It's a Matter of Time: Three Lessons on Long-Term Motion for Perception

TL;DR

The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone.

Abstract

Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.
Paper Structure (36 sections, 1 equation, 13 figures, 2 tables)

This paper contains 36 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Visualizations of dense point-tracked video instances where color represents temporal evolution (from blue to red). Long-term motion can reveal information about the structure of objects and the actions they are performing. The trajectories alone clearly capture the distinct spatio-temporal dynamics of each action, providing a rich, compact, and effective representation of motion.
  • Figure 2: MovT Architecture. Point-Tracks are factorized into motion and positional components for embedding. The resulting embeddings are fused for the final spatial transformer layer.
  • Figure 3: Lesson 1: Motion representations can solve a variety of tasks with high accuracy, comparable or better than video representations. We display the task-relevant performance metric in both the point-track and pixel input studies. Up and down arrows $(\uparrow/\downarrow)$ signify whether a higher or lower performance metric is better. The red line on RAVDESS MovT at 31% marks the performance of our baseline experiment, where the facial landmark points are input to MovT for training.
  • Figure 4: Lesson 1: Many classes can be solved with motion information alone, including examples where image representations perform poorly. Scatter plots displaying a comparison of the per-class accuracy of our MovT and PixT models in three classification tasks.
  • Figure 5: Performance impact of temporal cropping SSV2 videos.
  • ...and 8 more figures