Table of Contents
Fetching ...

Continual Learning of Conjugated Visual Representations through Higher-order Motion Flows

Simone Marullo, Matteo Tiezzi, Marco Gori, Stefano Melacci

TL;DR

This paper investigates the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations, and introduces a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity.

Abstract

Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.

Continual Learning of Conjugated Visual Representations through Higher-order Motion Flows

TL;DR

This paper investigates the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations, and introduces a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity.

Abstract

Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.
Paper Structure (13 sections, 13 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 13 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: The architecture of CMOSFET. Given a pair of consecutive frames $I_{t-1}$ and $I_{t}$, pixel-wise features $\bigl(f_{t_1}^{\ell}, f_{t}^{\ell}\bigr)$ and motion flow $\delta_{t}^{\ell}$ are extracted at multiple levels, indexed by $\ell$.
  • Figure 2: Illustration of the three $\mathcal{L}_c$ terms ($i.$, $ii.$, $iii.$) in Eq. \ref{['eq:megaconj']}. Each of the three sub-pictures include portions of the architecture of Fig. \ref{['fig:toy']}, while connections indicate the dependency introduced by the $\mathcal{L}_c$ term. Case $i.$ is about a single level, while $ii.$ and $iii.$ introduce cross-level dependencies.
  • Figure 3: Self-supervised loss, considering a pair of consecutive frames ($I_{t-1}$, $I_{t}$), at the first level ($\ell=1$) of the feature hierarchy (notice that it holds for any $\ell$). The objective $L_{self}^1$ (Eq. \ref{['eq:ssl']}) is the sum of two contrastive penalties computed through ${L}_{\dagger}$. The first contribution (left) encourages the development of features by comparing features extracted on the same frame ($I_{t-1}$), while the second term (right) encourages alignment between features extracted in a pair of frames ($I_{t-1}$, $I_{t}$), thanks to flow matching (warping coordinates of items on one side of similarity/dissimilarity relationships, e.g. $A\rightarrow A'$, $B\rightarrow B'$, $C\rightarrow C'$). Green (red) links connect pixels whose features are enforced to be similar (different) according to our motion-based criterion.
  • Figure 4: Illustration of different sampling strategies ($\ell=1$). Orange dots (bottom row) are points sampled from a visual stream (top row) in which a chair is moving in a static background that includes three smaller objects (pillow, laptop, teapot). First row: frame, estimated flow (different colors are about different directions), winning feature (different colors are about different winning features). Second row, left-to-right: plain uniform sampling, motion-biased sampling, and the proposed sampling driven by both motion and winning features. In the last case, the sampled coordinates cover both the moving chair and other details of the image in a balanced manner, while the first and second case give more emphasis to the uniform region (first, second) or the moving object (second).
  • Figure 5: Learning over time with CMOSFET. Exponential Moving Average (EMA) networks and gradient-updated networks (GRA) extract features from $I_{t}$ and $I_{t-1}$, respectively. White-dotted parts of the model are in addition to the ones of Fig. \ref{['fig:toy']}, and the diagonal bars $\bigl(//\bigr)$ indicate that no gradients are propagated on that path. The loss function ${L}$ of Eq. \ref{['eq:losscum']} drives the learning process of the motion predictors and of the feature extractors, enforcing GRA to be coherent with EMA and instantiating the motion-induced contrastive criterion.
  • ...and 3 more figures