Table of Contents
Fetching ...

A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

Xiaobao Wei, Jiawei Liu, Dongbo Yang, Junda Cheng, Changyong Shu, Wei Wang

TL;DR

The paper tackles the problem of inconsistent convergence between high- and low-frequency content in iterative stereo methods like RAFT-Stereo. It introduces Wavelet-Stereo, which uses Haar discrete wavelet transforms to explicitly decompose input images into high- and low-frequency components, followed by separate multi-scale feature extraction and an iterative High-frequency Preservation Update (HPU) that preserves edges while refining textures. The HPU comprises an Iterative-based Frequency Adapter with low- and high-frequency attention modules and a high-frequency preservation LSTM that conditions hidden-state updates on high-frequency priors, enabling more balanced, frequency-aware updates across iterations. Empirically, the approach achieves state-of-the-art results on KITTI 2012/2015 and Scene Flow benchmarks, demonstrates strong high- and low-frequency performance, and offers plug-and-play components for integration into other iterative stereo methods; a real-time variant is proposed as future work. All mathematical notation is kept precise with proper notation, and key equations are provided to formalize the loss and update mechanisms.

Abstract

We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter to provide adaptive refined high-frequency features at different iteration steps by fine-tuning the initial high-frequency features. By processing high and low frequency components separately, our framework can simultaneously refine high-frequency information in edges and low-frequency information in smooth regions, which is especially suitable for challenging scenes with fine details and textures in the distance. Extensive experiments demonstrate that our Wavelet-Stereo outperforms the state-of-the-art methods and ranks 1st on both the KITTI 2015 and KITTI 2012 leaderboards for almost all metrics. We will provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/SIA-IDE/Wavelet-Stereo).

A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

TL;DR

The paper tackles the problem of inconsistent convergence between high- and low-frequency content in iterative stereo methods like RAFT-Stereo. It introduces Wavelet-Stereo, which uses Haar discrete wavelet transforms to explicitly decompose input images into high- and low-frequency components, followed by separate multi-scale feature extraction and an iterative High-frequency Preservation Update (HPU) that preserves edges while refining textures. The HPU comprises an Iterative-based Frequency Adapter with low- and high-frequency attention modules and a high-frequency preservation LSTM that conditions hidden-state updates on high-frequency priors, enabling more balanced, frequency-aware updates across iterations. Empirically, the approach achieves state-of-the-art results on KITTI 2012/2015 and Scene Flow benchmarks, demonstrates strong high- and low-frequency performance, and offers plug-and-play components for integration into other iterative stereo methods; a real-time variant is proposed as future work. All mathematical notation is kept precise with proper notation, and key equations are provided to formalize the loss and update mechanisms.

Abstract

We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter to provide adaptive refined high-frequency features at different iteration steps by fine-tuning the initial high-frequency features. By processing high and low frequency components separately, our framework can simultaneously refine high-frequency information in edges and low-frequency information in smooth regions, which is especially suitable for challenging scenes with fine details and textures in the distance. Extensive experiments demonstrate that our Wavelet-Stereo outperforms the state-of-the-art methods and ranks 1st on both the KITTI 2015 and KITTI 2012 leaderboards for almost all metrics. We will provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/SIA-IDE/Wavelet-Stereo).

Paper Structure

This paper contains 13 sections, 9 equations, 13 figures, 8 tables, 2 algorithms.

Figures (13)

  • Figure 1: High and low frequency region EPE performance evaluation for some challenging scenes on ETH3D dataset 30. (a) Traditional iterative-based methods 52 process the all frequency components uniformly, resulting in inconsistent convergence in different frequency regions. (b) We design frequency-specific feature extraction and processing modules to achieve overall optimization for different frequency components.
  • Figure 2: Visual comparison on KITTI. All models are trained on Scene Flow and tested directly on KITTI 2829. Wavelet-MonSter outperforms MonSter in challenging areas with high-frequency details, fine structures.
  • Figure 3: Overview of Wavelet-RAFT. Wavelet-RAFT employs a dual-branch architecture comprising: (1) a dedicated feature extraction branch for capturing high-frequency texture features $E_h$, (2) a update branch that progressively refines structural information. The aggregated high-frequency features $F_h$ serve as guidance information injected into the High-frequency Preservation Update (HPU) operator to update the hidden states during each iteration.
  • Figure 4: The franework of proposed high-frequency feature extractor consisting of a U-shaped network and a series of convolutions blocks, effectively capturing high-frequency feature through multi-scale feature aggregation and skip connection.
  • Figure 5: (a) The iterative update process of hidden states $F_l$, guiding by the aggregated high-frequency $F_h$. (b) Proposed high-frequency preservation update operator that finetunes the high-frequency in iterative-based frequency adapter and update hidden states by high-frequency preservation LSTM. (c) Low-frequency selection attention module adaptively integrates low-frequency contextual information to enhance high-frequency features (d) High-frequency selection attention module injects high-frequency attention maps to enrich low-frequency features. (e) Our multi-level update structure to update hidden states from 1/16 to 1/4.
  • ...and 8 more figures