Table of Contents
Fetching ...

Unsupervised Learning from Video with Deep Neural Embeddings

Chengxu Zhuang, Tianwei She, Alex Andonian, Max Sobol Mark, Daniel Yamins

TL;DR

The paper tackles unsupervised visual learning from video by proposing Video Instance Embedding (VIE), which embeds videos into a unit-sphere latent space and optimizes cross-video relationships using IR/LA-type losses with a memory bank. It evaluates multiple architectures, including static, dynamic, and two-pathway models, showing that multi-stream approaches yield strong transfer to action recognition on Kinetics and object recognition on ImageNet, with two-pathway variants often performing best. The results highlight that dynamic processing excels for video tasks while static features better support static image transfer, and they underscore the value of long-range temporal context and larger datasets for improving unsupervised video representations. The work positions deep video embeddings as a promising direction for broad unsupervised learning across domains and datasets, and suggests future refinements with alternative losses and architectures.

Abstract

Because of the rich dynamical structure of videos and their ubiquity in everyday life, it is a natural idea that video data could serve as a powerful unsupervised learning signal for training visual representations in deep neural networks. However, instantiating this idea, especially at large scale, has remained a significant artificial intelligence challenge. Here we present the Video Instance Embedding (VIE) framework, which extends powerful recent unsupervised loss functions for learning deep nonlinear embeddings to multi-stream temporal processing architectures on large-scale video datasets. We show that VIE-trained networks substantially advance the state of the art in unsupervised learning from video datastreams, both for action recognition in the Kinetics dataset, and object recognition in the ImageNet dataset. We show that a hybrid model with both static and dynamic processing pathways is optimal for both transfer tasks, and provide analyses indicating how the pathways differ. Taken in context, our results suggest that deep neural embeddings are a promising approach to unsupervised visual learning across a wide variety of domains.

Unsupervised Learning from Video with Deep Neural Embeddings

TL;DR

The paper tackles unsupervised visual learning from video by proposing Video Instance Embedding (VIE), which embeds videos into a unit-sphere latent space and optimizes cross-video relationships using IR/LA-type losses with a memory bank. It evaluates multiple architectures, including static, dynamic, and two-pathway models, showing that multi-stream approaches yield strong transfer to action recognition on Kinetics and object recognition on ImageNet, with two-pathway variants often performing best. The results highlight that dynamic processing excels for video tasks while static features better support static image transfer, and they underscore the value of long-range temporal context and larger datasets for improving unsupervised video representations. The work positions deep video embeddings as a promising direction for broad unsupervised learning across domains and datasets, and suggests future refinements with alternative losses and architectures.

Abstract

Because of the rich dynamical structure of videos and their ubiquity in everyday life, it is a natural idea that video data could serve as a powerful unsupervised learning signal for training visual representations in deep neural networks. However, instantiating this idea, especially at large scale, has remained a significant artificial intelligence challenge. Here we present the Video Instance Embedding (VIE) framework, which extends powerful recent unsupervised loss functions for learning deep nonlinear embeddings to multi-stream temporal processing architectures on large-scale video datasets. We show that VIE-trained networks substantially advance the state of the art in unsupervised learning from video datastreams, both for action recognition in the Kinetics dataset, and object recognition in the ImageNet dataset. We show that a hybrid model with both static and dynamic processing pathways is optimal for both transfer tasks, and provide analyses indicating how the pathways differ. Taken in context, our results suggest that deep neural embeddings are a promising approach to unsupervised visual learning across a wide variety of domains.

Paper Structure

This paper contains 6 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Schematic of the Video Instance Embedding (VIE) Framework.a. Frames from individual videos ($\mathbf{v}_1$, $\mathbf{v}_2$, $\mathbf{v}_3$) are b. sampled into sequences of varying lengths and temporal densities, and input into c. deep neural network pathways that are either static (single image) or dynamic (multi-image). d. Outputs of frame samples from either pathway are vectors in the $D$-dimensional unit sphere $S^D \subset \mathbf{R}^{D+1}$. The running mean value of embedding vectors are calculated over online samples for each video, e. stored in a memory bank, and f. at each time step compared via unsupervised loss functions in the video embedding space. The loss functions require the computation of distribution properties of embedding vectors. For example, the Local Aggregation (LA) loss function involves the identification of Close Neighbors $\mathbf{C}_i$ (light brown points) and Background Neighbors $\mathbf{B}_i$ (dark brown points), which are used to determine how to move target point (green) relative to other points (red/blue).
  • Figure 2: Video retrieval results for VIE-Single and VIE-Slowfast models from Kinetics validation set. GT=ground truth label, Pred=model prediction. For each query video, three nearest training neighbors are shown. Red font indicates an error.