Prototypical Transformer as Unified Motion Learners

Cheng Han; Yawen Lu; Guohao Sun; James C. Liang; Zhiwen Cao; Qifan Wang; Qiang Guan; Sohail A. Dianat; Raghuveer M. Rao; Tong Geng; Zhiqiang Tao; Dongfang Liu

Prototypical Transformer as Unified Motion Learners

Cheng Han, Yawen Lu, Guohao Sun, James C. Liang, Zhiwen Cao, Qifan Wang, Qiang Guan, Sohail A. Dianat, Raghuveer M. Rao, Tong Geng, Zhiqiang Tao, Dongfang Liu

TL;DR

ProtoFormer introduces a unified motion-learning framework by embedding prototype learning into Transformer attention. It replaces standard self-attention with Cross-Attention Prototyping (EM clustering) and couples it with Latent Synchronization to align prototype representations with feature maps, addressing motion uncertainty. Empirical results demonstrate competitive optical-flow and depth-estimation performance and show potential for generalization to downstream tasks and improved interpretability through prototype visualization. This work offers a principled approach to unify diverse motion tasks under a single, transparent architecture with robustness benefits.

Abstract

In this work, we introduce the Prototypical Transformer (ProtoFormer), a general and unified framework that approaches various motion tasks from a prototype perspective. ProtoFormer seamlessly integrates prototype learning with Transformer by thoughtfully considering motion dynamics, introducing two innovative designs. First, Cross-Attention Prototyping discovers prototypes based on signature motion patterns, providing transparency in understanding motion scenes. Second, Latent Synchronization guides feature representation learning via prototypes, effectively mitigating the problem of motion uncertainty. Empirical results demonstrate that our approach achieves competitive performance on popular motion tasks such as optical flow and scene depth. Furthermore, it exhibits generality across various downstream tasks, including object tracking and video stabilization.

Prototypical Transformer as Unified Motion Learners

TL;DR

Abstract

Paper Structure (20 sections, 1 theorem, 14 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 20 sections, 1 theorem, 14 equations, 10 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Methodology
Preliminary
ProtoFormer
Cross-Attention Prototyping via EM clustering
Prototype-Feature Corresponding by Latent Synchronization
Implementation Details
Experiments
Experiments on Optical Flow
Experiments on Scene Depth
Diagnostic Experiments
Conclusion
Training and Testing Configuration
More Qualitative Results
...and 5 more sections

Key Result

Theorem 1

For $\gamma > 0$, and having $0 \leq \gamma \leq \lambda$, suppose the function $U(\cdot|\hat{\theta})$ is $\lambda$-strongly concave and FOS($\gamma$) holds for $\mathcal{B}_{2}(r;\hat{\theta})$, we have the $EM$ operator $M$ is contractive over $\mathcal{B}_{2}(r;\hat{\theta})$ as: for all $\theta \in \mathcal{B}_{2}(r;\hat{\theta})$. Intuitively, we can conduct that for any initial point $\the

Figures (10)

Figure 1: ProtoFormer as a unified framework considers motion as different levels of dynamics granularity (e.g., instance-driven flow, pixel-anchored depth, etc). are prototypes.
Figure 2: (a) Overall pipeline of ProtoFormer (§\ref{['subsec:protoformer']}). Movement of a small part of an object within an image is being considered as a rigid motion. In our approach, we use prototypes to understand or predict this kind of motion pattern. (b) In each layer of the Cross-Attention Prototyping (see §\ref{['subsec:cross-attention']}), there are $N$ sequential iterations encompassing the assignment of feature-prototypes (i.e., $E$-step) and the subsequent updating of these prototypes (i.e., $M$-step) via Eq. \ref{['eq:recurrent']}. (c) Concurrently, the Latent Synchronization process (see §\ref{['subsec:latent-syn']}) associates the feature representations via the freshly updated motion prototypes, (see Eq. \ref{['eq:latent']}). For (b) and (c), we apply optical flow for illustration, which demonstrates straightforward systemic explainability. More visualization results are shown in §\ref{['subsec:diag-exp']}.
Figure 3: Qualitative results on the Sintel. The red boxes highlight the regions compared. Matchflow dong2023rethinking appears blurry and ambiguous on textureless and occluded objects, while Flowformer huang2022flowformer fails to recover complete and detailed information. Ours can estimate clear and complete flow motion, which is closer to ground truth.
Figure 4: Qualitative depth comparison results on the KITTI. The red boxes indicate the highlighted regions. P3Depth patil2022p3depth and AdaBins bhat2021adabins have limited receptive fields and do not consider conceptual object-level groupings, thus producing discontinuous and ambiguous predictions. While ours can estimate consistent and sharp depths, which is closer to ground truth.
Figure 5: Visualization of proto-feature mapping, which demonstrates distinct prototypes with similar representations.
...and 5 more figures

Theorems & Definitions (2)

Theorem 1
proof

Prototypical Transformer as Unified Motion Learners

TL;DR

Abstract

Prototypical Transformer as Unified Motion Learners

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)