Learning predictable and robust neural representations by straightening image sequences
Xueyan Niu, Cristina Savin, Eero P. Simoncelli
TL;DR
The paper investigates learning predictive neural representations by promoting straightened temporal trajectories in video-derived embeddings. It introduces a parameter-free straightening objective combined with whitening regularizers, and demonstrates its effectiveness on synthetic sequential data, yielding representations that preserve dynamic attributes and enable accurate linear extrapolation. The authors show that straightened representations are more robust to noise and adversarial perturbations than invariance-based SSL methods, and that straightening can boost robustness when used as a regularizer with other SSL objectives. They also provide geometric insight into how straightening shapes trajectory structure to facilitate class separability, and discuss extensions to multi-timescale and hierarchical prediction. Overall, straightening emerges as a practical, robust principle for self-supervised learning from temporal visual inputs with broad applicability to other models and data domains.
Abstract
Prediction is a fundamental capability of all living organisms, and has been proposed as an objective for learning sensory representations. Recent work demonstrates that in primate visual systems, prediction is facilitated by neural representations that follow straighter temporal trajectories than their initial photoreceptor encoding, which allows for prediction by linear extrapolation. Inspired by these experimental findings, we develop a self-supervised learning (SSL) objective that explicitly quantifies and promotes straightening. We demonstrate the power of this objective in training deep feedforward neural networks on smoothly-rendered synthetic image sequences that mimic commonly-occurring properties of natural videos. The learned model contains neural embeddings that are predictive, but also factorize the geometric, photometric, and semantic attributes of objects. The representations also prove more robust to noise and adversarial attacks compared to previous SSL methods that optimize for invariance to random augmentations. Moreover, these beneficial properties can be transferred to other training procedures by using the straightening objective as a regularizer, suggesting a broader utility for straightening as a principle for robust unsupervised learning.
