Table of Contents
Fetching ...

FlowFeat: Pixel-Dense Embedding of Motion Profiles

Nikita Araslanov, Anna Sonnweber, Daniel Cremers

TL;DR

FlowFeat addresses the challenge of dense, high-resolution feature maps by embedding motion profiles into a pixel-level representation learned via self-supervised distillation from optical-flow signals. It learns a distribution of linear mappings $A^*$ that align the features with motion, using ridge regression against a teacher network, and is optimized with the loss $L_{total} = L_{grad} + ext{lambda} L_1$ to preserve motion boundaries. The method yields strong improvements across video object segmentation, semantic segmentation, and monocular depth estimation across multiple backbones and data scales, while remaining efficient and robust to flow quality. The results demonstrate that motion-aware, high-resolution representations can complement traditional backbone features to improve dense predictions without labeled data, enabling scalable, reliable dense image understanding and potential extensions to 3D reconstruction and tracking.

Abstract

Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.

FlowFeat: Pixel-Dense Embedding of Motion Profiles

TL;DR

FlowFeat addresses the challenge of dense, high-resolution feature maps by embedding motion profiles into a pixel-level representation learned via self-supervised distillation from optical-flow signals. It learns a distribution of linear mappings that align the features with motion, using ridge regression against a teacher network, and is optimized with the loss to preserve motion boundaries. The method yields strong improvements across video object segmentation, semantic segmentation, and monocular depth estimation across multiple backbones and data scales, while remaining efficient and robust to flow quality. The results demonstrate that motion-aware, high-resolution representations can complement traditional backbone features to improve dense predictions without labeled data, enabling scalable, reliable dense image understanding and potential extensions to 3D reconstruction and tracking.

Abstract

Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.

Paper Structure

This paper contains 19 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: FlowFeat is a versatile feature representation at pixel-level resolution. Embedding profiles of plausible motion, FlowFeat stands out from existing techniques by offering excellent spatial precision coupled with temporal consistency. Here, we visualise (using PCA with three principal components) a comparison of FlowFeat with the feature maps of the state-of-the-art vision encoders.
  • Figure 2: Embedding motion profiles: FlowFeat relies on the exponentially moving average (EMA) teacher model and learns to reconstruct apparent motion with a distribution of linear transformations. For a given frame $I_t$, we randomly sample its temporal counterpart $I_{t^\prime}$. A pre-trained network $\mathcal{F}$ computes optical flow $F_{(t \rightarrow t^\prime)}$. We generate two overlapping random crops of frame $I_t$ and feed the resulting views $v_1$ and $v_2$ to the teacher and the student networks, respectively. Obtaining the optimal linear transform $A^\ast$ on-the-fly with ridge regression in the teacher branch, we compute the reconstruction loss w.r.t. the flow crop $u_2$ to update the student parameters $\theta$ with gradient descent.
  • Figure 3: Left: Focal gradient matching term $\mathcal{L}_\nabla$. The first row visualises the first three PCA components of FlowFeat trained with and without the gradient term. Observe sharper feature boundaries with the use of the gradient term. Additionally, we found benefit in modulating the gradient difference with a hyperparameter $\sigma$, as defined in \ref{['eq:edge']}. The modulation with a lower $\sigma$ amplifies the effect of motion discontinuities (here, demonstrated for image gradients). Right: Qualitative examples on VOS. FlowFeat reveals finer details of the semantic masks compared to existing upsampling strategies, such as FeatUp Fu:2024:FeatUp.
  • Figure 4: Depth probing. FlowFeat significantly improves depth estimates for challenging elements, such as non-Lambertian surfaces (e.g. left, the piano), intricate structures (e.g. middle, the bicycle), and under- and oversaturated image areas (e.g. right, a bathroom).
  • Figure 5: Semantic segmentation and post-hoc refinement (++) with FlowFeat. The segmentation masks from FlowFeat exhibit a high level of boundary accuracy. The FlowFeat representation, visualised with PCA, identifies prominent scene elements with a fine-grained detail. A lightweight post-hoc refinement (FlowFeat-K++), based on PAMR Araslanov:2020:SSS, leverages the pairwise pixel similarity embedded by FlowFeat (instead of image intensities) to improve the results further.
  • ...and 1 more figures