Learning to Linearize Under Uncertainty
Ross Goroshin, Michael Mathieu, Yann LeCun
TL;DR
The paper tackles unsupervised learning of deep representations by forcing learned features to linearize short temporal transformations in unlabeled video through frame prediction. It introduces a two-part approach: a phase-pooling architecture that yields locally linearized magnitude and phase coordinates, and a latent-variable mechanism to handle inherent prediction uncertainty, combined with a curvature-regularized prediction loss in latent space. Key contributions include a concrete encoder-decoder framework with a $L$ loss that blends prediction accuracy and curvature minimization, a soft, differentiable phase-pooling operator, and an uncertainty-aware extension using latent variables $\delta$ that can be inferred per example. The experiments on shallow natural-data and NORB-based sequences demonstrate improved linearization, more realistic reconstructions, and the ability to capture multiple plausible futures, indicating strong potential for unsupervised, transferable feature learning in video domains.
Abstract
Training deep feature hierarchies to solve supervised learning tasks has achieved state of the art performance on many problems in computer vision. However, a principled way in which to train such hierarchies in the unsupervised setting has remained elusive. In this work we suggest a new architecture and loss for training deep feature hierarchies that linearize the transformations observed in unlabeled natural video sequences. This is done by training a generative model to predict video frames. We also address the problem of inherent uncertainty in prediction by introducing latent variables that are non-deterministic functions of the input into the network architecture.
