Table of Contents
Fetching ...

STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing

Andrea Alfarano, Alberto Alfarano, Linda Friso, Andrea Bacciu, Irene Amerini, Fabrizio Silvestri

TL;DR

STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together, using a single convolution to mix both types of features into a comprehensive spatiotemporal patch representation.

Abstract

Spatio-Temporal predictive Learning is a self-supervised learning paradigm that enables models to identify spatial and temporal patterns by predicting future frames based on past frames. Traditional methods, which use recurrent neural networks to capture temporal patterns, have proven their effectiveness but come with high system complexity and computational demand. Convolutions could offer a more efficient alternative but are limited by their characteristic of treating all previous frames equally, resulting in poor temporal characterization, and by their local receptive field, limiting the capacity to capture distant correlations among frames. In this paper, we propose STLight, a novel method for spatio-temporal learning that relies solely on channel-wise and depth-wise convolutions as learnable layers. STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together, using a single convolution to mix both types of features into a comprehensive spatio-temporal patch representation. This representation is then processed in a purely convolutional framework, capable of focusing simultaneously on the interaction among near and distant patches, and subsequently allowing for efficient reconstruction of the predicted frames. Our architecture achieves state-of-the-art performance on STL benchmarks across different datasets and settings, while significantly improving computational efficiency in terms of parameters and computational FLOPs. The code is publicly available

STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing

TL;DR

STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together, using a single convolution to mix both types of features into a comprehensive spatiotemporal patch representation.

Abstract

Spatio-Temporal predictive Learning is a self-supervised learning paradigm that enables models to identify spatial and temporal patterns by predicting future frames based on past frames. Traditional methods, which use recurrent neural networks to capture temporal patterns, have proven their effectiveness but come with high system complexity and computational demand. Convolutions could offer a more efficient alternative but are limited by their characteristic of treating all previous frames equally, resulting in poor temporal characterization, and by their local receptive field, limiting the capacity to capture distant correlations among frames. In this paper, we propose STLight, a novel method for spatio-temporal learning that relies solely on channel-wise and depth-wise convolutions as learnable layers. STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together, using a single convolution to mix both types of features into a comprehensive spatio-temporal patch representation. This representation is then processed in a purely convolutional framework, capable of focusing simultaneously on the interaction among near and distant patches, and subsequently allowing for efficient reconstruction of the predicted frames. Our architecture achieves state-of-the-art performance on STL benchmarks across different datasets and settings, while significantly improving computational efficiency in terms of parameters and computational FLOPs. The code is publicly available

Paper Structure

This paper contains 27 sections, 3 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: MSE vs Number of parameters for existing STL models and STLight on Moving MNIST dataset trained and evaluated under the same settings.
  • Figure 2: STLight model workflow. We rearrange the input sequence of frames along the channel dimension (1), and through a single convolutional layer, we encode the sequence into patches of size $p \times p$ with hidden temporal dimension $d$, containing both spatial and temporal information (2). The patches are processed through a custom STLMixer block repeated $\texttt{de}$ times (3). Each block processes the relationships between near (a) and distant (b) intra patches along the spatial dimension, as well as the intra-patch relationships on the temporal dimension (c). We decode the output sequence, restoring the initial spatial resolution through a patch shuffle (4) and the temporal resolution by reassembling the patches into the final output sequence (5).
  • Figure 3: Qualitative results on Moving MNIST and TaxiBJ datasets.
  • Figure 4: STLight models trained on KITTI (0.1M-15M parameters) outperform baselines on Caltech, demonstrating strong cross-dataset generalization with more efficient resource utilization.
  • Figure 5: Learning curve comparison between state-of-the-art methods and ours.
  • ...and 5 more figures