Table of Contents
Fetching ...

Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph

Ziyang Chen, Yongjun Zhang, Wenting Li, Bingshu Wang, Yong Zhao, C. L. Philip Chen

TL;DR

This work tackles the challenge of preserving geometric detail in learning-based stereo matching by introducing MoCha-V2, which uses a Motif Correlation Graph (MCG) to mine recurrent motif patterns across feature channels in a white-box fashion. By combining MCGA-based motif channels with a wavelet-based multi-frequency fusion and a reconstruction-based refinement (REMP), the method restores edge geometry while maintaining interpretability. The framework constructs a Motif Channel Correlation Volume (MCCV) and refines disparity through an iterative update and a full-resolution penalty, achieving state-of-the-art results on Scene Flow, Middlebury, KITTI, and ETH3D, with strong zero-shot generalization to Driving Stereo. While not fully transparent, MoCha-V2 advances explainable detail-learning in stereo and points to future work toward even more interpretable, safety-critical systems.

Abstract

Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs" within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at https://github.com/ZYangChen/MoCha-Stereo.

Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph

TL;DR

This work tackles the challenge of preserving geometric detail in learning-based stereo matching by introducing MoCha-V2, which uses a Motif Correlation Graph (MCG) to mine recurrent motif patterns across feature channels in a white-box fashion. By combining MCGA-based motif channels with a wavelet-based multi-frequency fusion and a reconstruction-based refinement (REMP), the method restores edge geometry while maintaining interpretability. The framework constructs a Motif Channel Correlation Volume (MCCV) and refines disparity through an iterative update and a full-resolution penalty, achieving state-of-the-art results on Scene Flow, Middlebury, KITTI, and ETH3D, with strong zero-shot generalization to Driving Stereo. While not fully transparent, MoCha-V2 advances explainable detail-learning in stereo and points to future work toward even more interpretable, safety-critical systems.

Abstract

Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs" within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at https://github.com/ZYangChen/MoCha-Stereo.

Paper Structure

This paper contains 20 sections, 10 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: In the deep learning process, feature channels are extended to higher dimensions. During this process, the geometric details learned in individual feature channels are inevitably lost. We propose that these details can be recovered by identifying recurrent geometric structures across the channels. MoCha-V2 constructs a directed graph, i.e., motif correlation graph (MCG), to capture frequent patterns in feature channels. Specifically, MCG leverages node weights to detect recurring geometric structures within the channels. MCG establishes a robust and stable framework for identifying repetitive geometric structures. Due to its interpretability, MoCha-V2 also demonstrates significant potential for safety-critical real-world applications.
  • Figure 2: Pipeline of our previous work, MoCha-Stereo mocha. MoCha-Stereo first constructs the Motif Channel Correlation Volume (MCCV). MCCV is build by projecting the relationship between motif channels and normal channels into the basic group correlation volume. Subsequently, we employ a iterative update operator to caculate a disparity, accroding to the correlation value. Finally, the Reconstruction Error Motif Penalty (REMP) module is applied to penalize the generation of the full-resolution disparity map. In REMP, $LFE$ refers to the Low-frequency Error branch, $LMC$ denotes to the Latent Motif Channel branch, and $HFE$ means the High-frequency Error branch.
  • Figure 3: Architecture overview of MoCha-V2. We improved the method of obtaining Motif Channels in MoCha-Stereo mocha by introducing the Motif Correlation Graph Attention (MCGA).
  • Figure 4: Visual comparisons with state-of-the-art stereo methods gmstereoselective on the Middlebury test set. We present the disparity map and error map with a 1-pixel threshold. In the error map, the black regions indicate areas of misestimation.
  • Figure 5: Visual comparisons with state-of-the-art stereo methods ucfnetselective on the KITTI 2012 test set. In the first row of images, the streetlight presents a matching challenge due to its thin structure. UCFNet fails to capture the presence of the poles, and Selective-IGEV inaccurately estimates the sky above the streetlight as being closer in depth to the streetlight. Only MoCha-V2 provides a more accurate estimation of the streetlight's edges. In the second row, the roofs of the two cars in the image reflect natural light, which affects the accuracy of stereo matching. Both UCFNet and Selective-IGEV are impacted by this reflection and fail to accurately capture the cars' edge textures. Additionally, Selective-IGEV does not estimate the contours of the poles on the wall. In the third row, due to lighting influences, UCFNet and Selective-IGEV misinterpret the buildings in the background as part of the trees in the foreground. In contrast, MoCha-V2 delineates a more accurate edge texture.
  • ...and 5 more figures