Table of Contents
Fetching ...

FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement

Sadra Safadoust, Fabio Tosi, Matteo Poggi, Fatma Güney

Abstract

We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.

FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement

Abstract

We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.

Paper Structure

This paper contains 14 sections, 11 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Qualitative comparison between WAFT wang2025waft and our FlowIt model. Both models are fine-tuned on FlyingThings and evaluated zero-shot on Sintel Final training set. FlowIt can generalize much better to this unseen domain.
  • Figure 2: Architecture Overview. FlowIt extracts multi-scale features from images using a CNN encoder followed by a Feature Pyramid Network (FPN). These features are processed with one or more Multi-Resolution Transformer (MRT) blocks. A 4D correlation volume is constructed using the $\frac{1}{4}$ resolution features, and optimal transport is applied to produce a 4D probability map. Initial flow, occlusion, and confidence maps are derived using the probability map. These predictions are refined through a global refinement step, followed by three local refinement iterations to obtain the final outputs.
  • Figure 3: Correlation and probability visualizations for a single pixel.(a) and (b) show the correlation values in the pixel's neighborhood for SEA-RAFT and FlowIt, respectively. (c) shows the corresponding probability map obtained by applying optimal transport to FlowIt's correlation volume.
  • Figure 4: Qualitative Results on KITTI kitti Test Set. Visualizations and error metrics are obtained directly from the official evaluation server. For WAFT, we report results using the DAv2-a2 variant, which achieves the best performance among WAFT models on the KITTI benchmark.
  • Figure 5: Qualitative Results on Sintel sintel Training Set. From left to right: first frame, flow by WAFT-DAv2-a2 and FlowIt (XL), ground-truth flow.
  • ...and 8 more figures