Table of Contents
Fetching ...

ProMotion: Prototypes As Motion Learners

Yawen Lu, Dongfang Liu, Qifan Wang, Cheng Han, Yiming Cui, Zhiwen Cao, Xueling Zhang, Yingjie Victor Chen, Heng Fan

TL;DR

This work tackles the fragmentation of motion learning by introducing ProMotion, a unified prototypical motion framework that treats motion as a set of prototypes learned through subspace-structured features. A hierarchical Transformer-based feature denoiser reduces noise and uncertainty, while a prototypical learner clusters denoised subspaces into motion prototypes that can power both optical flow and scene depth estimation. The approach yields strong empirical gains on Sintel and KITTI benchmarks, including significant reductions in AEPE and Abs Rel relative to specialized methods, and demonstrates robust transfer to downstream 2D and 3D tasks. By unifying motion tasks under a single prototypical paradigm, ProMotion has the potential to catalyze the development of more universal, transfer-friendly vision models.

Abstract

In this work, we introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks. ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms. We adopt a prototypical perspective, establishing a unified paradigm that harmonizes disparate motion learning approaches. This novel paradigm streamlines the architectural design, enabling the simultaneous assimilation of diverse motion information. We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion. This approach effectively circumvents the pitfalls of ambiguity in pixel-wise feature matching, significantly bolstering the robustness of motion representation. We demonstrate a profound degree of transferability across distinct motion patterns. This inherent versatility reverberates robustly across a comprehensive spectrum of both 2D and 3D downstream tasks. Empirical results demonstrate that ProMotion outperforms various well-known specialized architectures, achieving 0.54 and 0.054 Abs Rel error on the Sintel and KITTI depth datasets, 1.04 and 2.01 average endpoint error on the clean and final pass of Sintel flow benchmark, and 4.30 F1-all error on the KITTI flow benchmark. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.

ProMotion: Prototypes As Motion Learners

TL;DR

This work tackles the fragmentation of motion learning by introducing ProMotion, a unified prototypical motion framework that treats motion as a set of prototypes learned through subspace-structured features. A hierarchical Transformer-based feature denoiser reduces noise and uncertainty, while a prototypical learner clusters denoised subspaces into motion prototypes that can power both optical flow and scene depth estimation. The approach yields strong empirical gains on Sintel and KITTI benchmarks, including significant reductions in AEPE and Abs Rel relative to specialized methods, and demonstrates robust transfer to downstream 2D and 3D tasks. By unifying motion tasks under a single prototypical paradigm, ProMotion has the potential to catalyze the development of more universal, transfer-friendly vision models.

Abstract

In this work, we introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks. ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms. We adopt a prototypical perspective, establishing a unified paradigm that harmonizes disparate motion learning approaches. This novel paradigm streamlines the architectural design, enabling the simultaneous assimilation of diverse motion information. We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion. This approach effectively circumvents the pitfalls of ambiguity in pixel-wise feature matching, significantly bolstering the robustness of motion representation. We demonstrate a profound degree of transferability across distinct motion patterns. This inherent versatility reverberates robustly across a comprehensive spectrum of both 2D and 3D downstream tasks. Empirical results demonstrate that ProMotion outperforms various well-known specialized architectures, achieving 0.54 and 0.054 Abs Rel error on the Sintel and KITTI depth datasets, 1.04 and 2.01 average endpoint error on the clean and final pass of Sintel flow benchmark, and 4.30 F1-all error on the KITTI flow benchmark. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
Paper Structure (11 sections, 7 equations, 4 figures, 3 tables)

This paper contains 11 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: ProMotion harmoniously handle core motion tasks (i.e., optical flow and scene depth estimation), within an elegant prototype-based framework. This adaptability further enhances its utility in enabling seamless knowledge transfer to various downstream applications. For depth bhat2021adabinsranftl2021visionP3Depth, we averagely reduce 0.13 $Abs\ Rel$; For flow xu2022gmflowhuang2022flowformerdong2023rethinking, we averagely reduce 0.38 AEPE; For downstream tasks, we averagely achieve 3.6% $mAP$ boost in video object detection chen20megazhou2022transvodshi2023yolov, 3.4% $AP$ boost in video object segmentation cheng2021mask2formerwu2022seqformerhuang2022minvis, and 7.3% $AP_{3D}$ boost in 3D object detection li2020rtm3dhuang2022monodtrmonorun2021.
  • Figure 2: (a) Overall pipeline of ProMotion. (b) Each Transformer block in feature denoiser maps the input tokens into different feature subspaces (Eq. \ref{['denoising']}) and then projects them to the orthogonal direction (Eq. \ref{['projection']}), therby mitigating the uncertainties in motion for robustness. (c) The prototypical learner clusters the subspace into prototypes (Eq. \ref{['cluster']}) and performs iterations to update them (Eq. \ref{['update']}). The learned prototype can capture different motion patterns, enabling representation learning for various dynamic characteristics.
  • Figure 3: Qualitative comparison of optical flow on Sintel and KITTI val set. Notable areas are marked with red circles. Compared to huang2022flowformer and dong2023rethinking, our approach shows better ability to reduce matching uncertainties due to similar patterns, illumination changes, shadows, etc.
  • Figure 4: Qualitative comparison of scene depth on the Sintel and KITTI val set. Notable areas are marked with red circles. Compared to bhat2021adabins and ranftl2021vision, our approach produces more consistent and smooth depths with complete object shapes and clear boundaries. Sparse ground truths in KITTI are interpolated for better visualization.