Table of Contents
Fetching ...

Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics

Christopher Hoang, Mengye Ren

TL;DR

Midway Network tackles the gap in self-supervised learning by jointly learning object recognition and motion understanding from natural videos through latent dynamics. It introduces a midway top-down path to infer motion latents, a dense multi-level forward-prediction objective, and a hierarchical backward refinement to handle complex scenes. The approach achieves strong results on semantic segmentation and optical flow after pretraining on large natural video datasets and provides a forward-perturbation analysis to reveal learned correspondences. This work advances self-supervised learning by unifying recognition and motion in a single framework and demonstrates potential for real-world planning with future extensions to action data.

Abstract

Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.

Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics

TL;DR

Midway Network tackles the gap in self-supervised learning by jointly learning object recognition and motion understanding from natural videos through latent dynamics. It introduces a midway top-down path to infer motion latents, a dense multi-level forward-prediction objective, and a hierarchical backward refinement to handle complex scenes. The approach achieves strong results on semantic segmentation and optical flow after pretraining on large natural video datasets and provides a forward-perturbation analysis to reveal learned correspondences. This work advances self-supervised learning by unifying recognition and motion in a single framework and demonstrates potential for real-world planning with future extensions to action data.

Abstract

Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.

Paper Structure

This paper contains 35 sections, 3 equations, 15 figures, 10 tables, 1 algorithm.

Figures (15)

  • Figure 1: (a) Traditional SSL methods focus on learning representations for object recognition and lean on curated, iconic image data for training. (b) Dense SSL methods extend training to natural videos, but either do not utilize motion transformations gordon:2020:VINCEvenkataramanan:2024:imagenet_dora for training or rely on external networks to incorporate motion xiong:2021:flowewang:2025:poodle. (c) Our proposed Midway Network jointly learns representations of semantics and motion from solely natural videos via latent dynamics modeling. The learned image-level representations can be used towards downstream object recognition and motion understanding tasks.
  • Figure 2: Midway Network employs a hierarchical design in which the midway path infers motion latents $m$ between source and target features in a top-down manner. Within each level of this hierarchy, backward layers with top-down and lateral connections refine the source features $z_t^l$. Forward prediction blocks, conditioned on the refined features $v_t^l$ and motion latents $m^{l+1}$, predict the dense target features $z_{t+1}^l$, and the prediction loss $\mathcal{L}_{dyn}$ jointly trains all components at each level.
  • Figure 3: Attention layer with gating unit on $v_t$.
  • Figure 4: Visualization of BDD semantic segmentation UperNet readout. Midway Network is able to produce cleaner object boundaries, particularly for the cyclist on the right.
  • Figure 5: Visualization of FlyingThings and MPI-Sintel optical flow evaluations after finetuning. Midway Network is able to generate more accurate optical flow predictions compared to CroCo v2.
  • ...and 10 more figures