Table of Contents
Fetching ...

WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching

Yihan Wang, Jia Deng

Abstract

We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary for strong performance and can be replaced by warping with improved efficiency. WAFT-Stereo ranks first on ETH3D (BP-0.5), Middlebury (RMSE), and KITTI (all metrics), reducing the zero-shot error by 81% on ETH3D, while being 1.8-6.7x faster than competitive methods. Code and model weights are available at https://github.com/princeton-vl/WAFT-Stereo.

WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching

Abstract

We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary for strong performance and can be replaced by warping with improved efficiency. WAFT-Stereo ranks first on ETH3D (BP-0.5), Middlebury (RMSE), and KITTI (all metrics), reducing the zero-shot error by 81% on ETH3D, while being 1.8-6.7x faster than competitive methods. Code and model weights are available at https://github.com/princeton-vl/WAFT-Stereo.

Paper Structure

This paper contains 30 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: WAFT-Stereo achieves strong sim-to-real generalization eth3dmiddleburymenze2015objectbao2020instereo2k.
  • Figure 2: WAFT-Stereo achieves state-of-the-art performance on Middlebury middlebury, KITTI-2015 menze2015object, and ETH3D eth3d public benchmarks. Our best-performing model reduces the zero-shot error on ETH3D by at least 61%, while being $1.8-6.7\times$ faster than leading methods wen2025foundationstereomin2025s2m2. Our real-time model can process 540p stereo pairs at 21 FPS, while maintaining competitive performance. 'ZS' denotes zero-shot submissions.
  • Figure 3: WAFT-Stereo consists of three parts: (1) an input encoder that extracts features from images; (2) a classification step that estimates probabilities over preset disparity bins, supervised by a soft-cross-entropy loss; and (3) a recurrent updater that takes backward-warped right view features as input and regresses disparity updates, supervised by a Mixture-of-Laplace loss wang2024sea for $T-1$ steps.
  • Figure 4: Full cost volumes compute matching costs for all disparity candidates; partial cost volumes compute costs only in a small window around the current disparity estimate; warping aligns the target feature using the current estimate and concatenates aligned and reference features, without computing the matching cost.
  • Figure 5: Left: combining classification and regression achieves better performance than using either one alone. Right: the classification step provides a rough estimate, which is later refined by regressions.
  • ...and 1 more figures