FlowFeat: Pixel-Dense Embedding of Motion Profiles
Nikita Araslanov, Anna Sonnweber, Daniel Cremers
TL;DR
FlowFeat addresses the challenge of dense, high-resolution feature maps by embedding motion profiles into a pixel-level representation learned via self-supervised distillation from optical-flow signals. It learns a distribution of linear mappings $A^*$ that align the features with motion, using ridge regression against a teacher network, and is optimized with the loss $L_{total} = L_{grad} + ext{lambda} L_1$ to preserve motion boundaries. The method yields strong improvements across video object segmentation, semantic segmentation, and monocular depth estimation across multiple backbones and data scales, while remaining efficient and robust to flow quality. The results demonstrate that motion-aware, high-resolution representations can complement traditional backbone features to improve dense predictions without labeled data, enabling scalable, reliable dense image understanding and potential extensions to 3D reconstruction and tracking.
Abstract
Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.
