Table of Contents
Fetching ...

A polar prediction model for learning to represent visual transformations

Pierre-Étienne H. Fiquet, Eero P. Simoncelli

TL;DR

A self-supervised representation-learning framework that extracts and exploits the regularities of natural videos to compute accurate predictions and achieves better prediction performance than traditional motion compensation and rivals conventional deep networks, while maintaining interpretability and speed.

Abstract

All organisms make temporal predictions, and their evolutionary fitness level depends on the accuracy of these predictions. In the context of visual perception, the motions of both the observer and objects in the scene structure the dynamics of sensory signals, allowing for partial prediction of future signals based on past ones. Here, we propose a self-supervised representation-learning framework that extracts and exploits the regularities of natural videos to compute accurate predictions. We motivate the polar architecture by appealing to the Fourier shift theorem and its group-theoretic generalization, and we optimize its parameters on next-frame prediction. Through controlled experiments, we demonstrate that this approach can discover the representation of simple transformation groups acting in data. When trained on natural video datasets, our framework achieves better prediction performance than traditional motion compensation and rivals conventional deep networks, while maintaining interpretability and speed. Furthermore, the polar computations can be restructured into components resembling normalized simple and direction-selective complex cell models of primate V1 neurons. Thus, polar prediction offers a principled framework for understanding how the visual system represents sensory inputs in a form that simplifies temporal prediction.

A polar prediction model for learning to represent visual transformations

TL;DR

A self-supervised representation-learning framework that extracts and exploits the regularities of natural videos to compute accurate predictions and achieves better prediction performance than traditional motion compensation and rivals conventional deep networks, while maintaining interpretability and speed.

Abstract

All organisms make temporal predictions, and their evolutionary fitness level depends on the accuracy of these predictions. In the context of visual perception, the motions of both the observer and objects in the scene structure the dynamics of sensory signals, allowing for partial prediction of future signals based on past ones. Here, we propose a self-supervised representation-learning framework that extracts and exploits the regularities of natural videos to compute accurate predictions. We motivate the polar architecture by appealing to the Fourier shift theorem and its group-theoretic generalization, and we optimize its parameters on next-frame prediction. Through controlled experiments, we demonstrate that this approach can discover the representation of simple transformation groups acting in data. When trained on natural video datasets, our framework achieves better prediction performance than traditional motion compensation and rivals conventional deep networks, while maintaining interpretability and speed. Furthermore, the polar computations can be restructured into components resembling normalized simple and direction-selective complex cell models of primate V1 neurons. Thus, polar prediction offers a principled framework for understanding how the visual system represents sensory inputs in a form that simplifies temporal prediction.
Paper Structure (37 sections, 7 equations, 7 figures, 3 tables)

This paper contains 37 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Straightening translations.(a) Three snapshots of a translating signal consisting of two superimposed sinusoidal components: $x_{n,t} = \sin(2\pi(n-t)) + \sin(2\pi3(n-t))/2$. (b) Projection of the signal into the space of the top three principal components. The colored points correspond to the three snapshots in panel (a). In signal space, the temporal trajectory is highly curved---linear extrapolation fails. (c) Complex-valued Fourier coefficients of the signal as a function of frequency. The temporal trajectory of the frequency representation is the phase advance of each sinusoidal component. (d) Trajectory of one amplitude and both (unwrapped) phases components. The conversion from rectangular to polar coordinates reduces the trajectory to a straight line---which is predictable via linear extrapolation.
  • Figure 2: Polar prediction model. The previous and current images in a sequence ($\mathbf{x}_{t-1}$ and $\mathbf{x}_{t}$) are convolved with pairs of filters ($\mathbf{W}^*$), each yielding complex-valued coefficients. For a given spatial location in the image, the coefficients for each pair of filters are depicted in complex planes with colors corresponding to time step. The coefficients at time $t+1$ are predicted from those at times $t-1$ and $t$ by extrapolating the phase ($\bm{\delta}_t$). These predicted coefficients are then convolved with the adjoint filters ($\mathbf{W}$) to generate a prediction of the next image in the sequence ($\hat{\mathbf{x}}_{t+1}$). This prediction is compared to the next frame ($\mathbf{x}_{t+1}$) by computing the mean squared error (MSE) and the filters are learned by minimizing this error. Notice that, at coarser scales, the coefficient amplitudes tend to be larger and the phase advance smaller, compared to finer scales.
  • Figure 3: Laplacian pyramid. An image is recursively split into low frequency approximation and high frequency details. Given the initial image $\mathbf{x} = \mathbf{x}_{j=0} \in \mathbb{R}^N$, the low frequency approximation (aka. Gaussian pyramid coefficients) is computed via blurring (convolution with a fixed filter $B$) and downsampling ("stride" of 2, denoted $2_\downarrow$): $\mathbf{x}_{j} = 2_\downarrow (B \star \mathbf{x}_{j-1} \in \mathbb{R}^{2^{-j}N}$), for levels $j \in [1, J]$; and the high frequency details (aka. Laplacian pyramid coefficients) are computed via upsampling (put one zero between each sample, $2^\uparrow$) and blurring: $\Delta \mathbf{x}_j = \mathbf{x}_j - B \star (2^\uparrow \mathbf{x}_{j+1})$. These coefficients, $\{ \Delta \mathbf{x}_j \}_{0 \le j < J}$, as well as the lowpass, $x_J$, can then be further processed. A new image is constructed recursively on these processed coefficients. First by upsampling the lowest resolution, and then by adding the corresponding details until the initial scale $j=0$ as: $\mathbf{x}_j = B \star (2^\uparrow \mathbf{x}_{j+1}) + \Delta \mathbf{x}_j$.
  • Figure 4: Learnable quadratic prediction mechanism. Groups of coefficients ($\mathbf{y}_{k,t}$) at the previous and current time-step are normalized ($\mathbf{u}_{k,t}$) and then passed through in a Linear-Square-Linear cascade to produce a prediction matrix ($\mathbf{M}_{k,t}$). This matrix is applied to the current vector of coefficients to predict the next one. The linear transforms ($\mathbf{L}_1$ and $\mathbf{L}_2$) are learned. This quadratic prediction module contains phase extrapolation as a special case and handles the more general case of groups of coefficients beyond pairs.
  • Figure 5: Example image sequence and predictions. A typical example image sequence from the DAVIS test set. The first three frames on the top row display the unprocessed images, and last five frames show the respective prediction for each method. The bottom row displays error maps computed as the difference between the target image and each predicted next frame on the corresponding position in the first row. All subfigures are shown on the same scale.
  • ...and 2 more figures