Table of Contents
Fetching ...

CCMR: High Resolution Optical Flow Estimation via Coarse-to-Fine Context-Guided Motion Reasoning

Azin Jahedi, Maximilian Luz, Marc Rivinius, Andrés Bruhn

TL;DR

CCMR tackles high-resolution optical flow estimation by unifying coarse-to-fine multi-scale processing with global-context guided motion reasoning. It introduces a two-step approach: first compute global context features with cross-attention (XCiT) to capture scale-specific global information, then guide motion grouping across scales using cross-attention between global context and motion features, all within a four-scale coarse-to-fine framework. An improved feature consolidation unit and a lean context encoder reduce model size while preserving expressiveness, enabling efficient high-resolution reasoning. Empirical results show strong improvements in occluded and non-occluded regions, achieving state-of-the-art on KITTI 2015 and competitive performance on MPI Sintel, with favorable pre-training generalization and memory efficiency compared to prior methods.

Abstract

Attention-based motion aggregation concepts have recently shown their usefulness in optical flow estimation, in particular when it comes to handling occluded regions. However, due to their complexity, such concepts have been mainly restricted to coarse-resolution single-scale approaches that fail to provide the detailed outcome of high-resolution multi-scale networks. In this paper, we hence propose CCMR: a high-resolution coarse-to-fine approach that leverages attention-based motion grouping concepts to multi-scale optical flow estimation. CCMR relies on a hierarchical two-step attention-based context-motion grouping strategy that first computes global multi-scale context features and then uses them to guide the actual motion grouping. As we iterate both steps over all coarse-to-fine scales, we adapt cross covariance image transformers to allow for an efficient realization while maintaining scale-dependent properties. Experiments and ablations demonstrate that our efforts of combining multi-scale and attention-based concepts pay off. By providing highly detailed flow fields with strong improvements in both occluded and non-occluded regions, our CCMR approach not only outperforms both the corresponding single-scale attention-based and multi-scale attention-free baselines by up to 23.0% and 21.6%, respectively, it also achieves state-of-the-art results, ranking first on KITTI 2015 and second on MPI Sintel Clean and Final. Code and trained models are available at https://github.com/cv-stuttgart /CCMR.

CCMR: High Resolution Optical Flow Estimation via Coarse-to-Fine Context-Guided Motion Reasoning

TL;DR

CCMR tackles high-resolution optical flow estimation by unifying coarse-to-fine multi-scale processing with global-context guided motion reasoning. It introduces a two-step approach: first compute global context features with cross-attention (XCiT) to capture scale-specific global information, then guide motion grouping across scales using cross-attention between global context and motion features, all within a four-scale coarse-to-fine framework. An improved feature consolidation unit and a lean context encoder reduce model size while preserving expressiveness, enabling efficient high-resolution reasoning. Empirical results show strong improvements in occluded and non-occluded regions, achieving state-of-the-art on KITTI 2015 and competitive performance on MPI Sintel, with favorable pre-training generalization and memory efficiency compared to prior methods.

Abstract

Attention-based motion aggregation concepts have recently shown their usefulness in optical flow estimation, in particular when it comes to handling occluded regions. However, due to their complexity, such concepts have been mainly restricted to coarse-resolution single-scale approaches that fail to provide the detailed outcome of high-resolution multi-scale networks. In this paper, we hence propose CCMR: a high-resolution coarse-to-fine approach that leverages attention-based motion grouping concepts to multi-scale optical flow estimation. CCMR relies on a hierarchical two-step attention-based context-motion grouping strategy that first computes global multi-scale context features and then uses them to guide the actual motion grouping. As we iterate both steps over all coarse-to-fine scales, we adapt cross covariance image transformers to allow for an efficient realization while maintaining scale-dependent properties. Experiments and ablations demonstrate that our efforts of combining multi-scale and attention-based concepts pay off. By providing highly detailed flow fields with strong improvements in both occluded and non-occluded regions, our CCMR approach not only outperforms both the corresponding single-scale attention-based and multi-scale attention-free baselines by up to 23.0% and 21.6%, respectively, it also achieves state-of-the-art results, ranking first on KITTI 2015 and second on MPI Sintel Clean and Final. Code and trained models are available at https://github.com/cv-stuttgart /CCMR.
Paper Structure (19 sections, 3 equations, 9 figures, 6 tables)

This paper contains 19 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison of our method to the ground truth (first row), recent approaches from the literature (second row) and the baselines (third row). Our method offers more structural details in the foreground and background (see boxes).
  • Figure 2: Coarse-to-fine architecture. The flow is first estimated on the coarsest scale, upsampled and used as initialization on the next finer scale. Note that the third scale is omitted for compactness.
  • Figure 3: Improved multi-scale feature consolidation (top) and global context computation (bottom). Outputs shown on the right are used as inputs in the matching module in \ref{['fig:coarse_to_fine']}.
  • Figure 4: Visualization of context features. Top to bottom: reference frame, visualization of context $C_4$, $GC_4$, two channels of $GC_4$. Similar to xcit_2021_ali, the $L_2$ norm of each feature map is shown as heat map. Each feature map is normalized individually.
  • Figure 5: Matching block of our CCMR approach for each scale. Motion features are guided based on global context iteratively and used in the flow update computation. The motion encoder and the update block are shared among scales.
  • ...and 4 more figures