Table of Contents
Fetching ...

Video alignment using unsupervised learning of local and global features

Niloufar Fakhfour, Mohammad ShahverdiKondori, Sajjad Hashembeiki, Mohammadjavad Norouzi, Hoda Mohammadzade

TL;DR

The results show that the unsupervised method for alignment that uses global and local features of the frames for alignment outperforms previous state-of-the-art methods, such as TCC, and other self-supervised and weakly supervised methods.

Abstract

In this paper, we tackle the problem of video alignment, the process of matching the frames of a pair of videos containing similar actions. The main challenge in video alignment is that accurate correspondence should be established despite the differences in the execution processes and appearances between the two videos. We introduce an unsupervised method for alignment that uses global and local features of the frames. In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network. Then the features are processed and combined to construct a multidimensional time series that represent the video. The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW). The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it. Additionally, our approach can be used for framewise labeling of action phases in a dataset with only a few labeled videos. For evaluation, we considered video synchronization and phase classification tasks on the Penn action and subset of UCF101 datasets. Also, for an effective evaluation of the video synchronization task, we present a new metric called Enclosed Area Error(EAE). The results show that our method outperforms previous state-of-the-art methods, such as TCC, and other self-supervised and weakly supervised methods.

Video alignment using unsupervised learning of local and global features

TL;DR

The results show that the unsupervised method for alignment that uses global and local features of the frames for alignment outperforms previous state-of-the-art methods, such as TCC, and other self-supervised and weakly supervised methods.

Abstract

In this paper, we tackle the problem of video alignment, the process of matching the frames of a pair of videos containing similar actions. The main challenge in video alignment is that accurate correspondence should be established despite the differences in the execution processes and appearances between the two videos. We introduce an unsupervised method for alignment that uses global and local features of the frames. In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network. Then the features are processed and combined to construct a multidimensional time series that represent the video. The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW). The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it. Additionally, our approach can be used for framewise labeling of action phases in a dataset with only a few labeled videos. For evaluation, we considered video synchronization and phase classification tasks on the Penn action and subset of UCF101 datasets. Also, for an effective evaluation of the video synchronization task, we present a new metric called Enclosed Area Error(EAE). The results show that our method outperforms previous state-of-the-art methods, such as TCC, and other self-supervised and weakly supervised methods.
Paper Structure (27 sections, 9 equations, 6 figures, 6 tables)

This paper contains 27 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We propose an unsupervised method to align pairs of videos that present the same actions. We model a video as a time series which consists of global and local features extracted from each frame. In addition, we introduce a novel DTW, called Diagonalized Dynamic Time Warping (DDTW), to find corresponding frames in each pair of videos.
  • Figure 2: In our method, two types of features are used to build time series: local features, including (pose and box features) and global features. Depending on the extracted pose and box, static and dynamic features are calculated for each image. To calculate the global features, we multiply the pixels of each frame by Gaussian weight according to the extracted box and apply the final frame to the input of the VGG network and extract the global features based on it.
  • Figure 3: VGG network is used to calculate the global features. The input of the network is a weighted frame based on the truncated 2D Gaussian weight. In order to adapt the network, the last three fully connected layers are replaced with the 2D max-pooling layer of stride $(1, 1)$ and filter size $(7, 7)$, followed by the flatten layer, and then the 1D max-pooling layer with a stride of $1$ and a size of $8$.
  • Figure 4: DDTW method, the green lines parallel to the diagonal show the margin. The blue path shows the alignment of frames, and going out of the margin results in a penalty, which is calculated according to the distance from the diagonal.
  • Figure 5: The enclosed area for our predicted and trivial path for three pairs of videos. For the trivial method, the alignment path is the straight line passing through the lower left and upper right corners of the table.
  • ...and 1 more figures