Table of Contents
Fetching ...

MoCha-Stereo: Motif Channel Attention Network for Stereo Matching

Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bingshu Wang, Yongbin Qin, Jia Wu

TL;DR

MoCha-Stereo addresses edge-detail mismatches in learning-based stereo by restoring geometric structure in feature channels through Motif Channel Attention (MCA) and a Motif Channel Correlation Volume (MCCV). It couples MCCV with a Reconstruction Error Motif Penalty (REMP) for full-resolution refinement, yielding state-of-the-art results on KITTI-2015 and KITTI-2012 Reflective, strong zero-shot generalization on Middlebury and ETH3D, and competitive MVS performance via MoCha-MVS. The method demonstrates that exploiting recurring geometric patterns in channel features and reconstruction-error motifs yields sharper edge matching and robust cross-domain performance. These contributions offer a practical path to high-fidelity disparity with efficient iterative refinement, enhancing edge fidelity in real-world stereo and MVS applications.

Abstract

Learning-based stereo matching techniques have made significant progress. However, existing methods inevitably lose geometrical structure information during the feature channel generation process, resulting in edge detail mismatches. In this paper, the Motif Cha}nnel Attention Stereo Matching Network (MoCha-Stereo) is designed to address this problem. We provide the Motif Channel Correlation Volume (MCCV) to determine more accurate edge matching costs. MCCV is achieved by projecting motif channels, which capture common geometric structures in feature channels, onto feature maps and cost volumes. In addition, edge variations in %potential feature channels of the reconstruction error map also affect details matching, we propose the Reconstruction Error Motif Penalty (REMP) module to further refine the full-resolution disparity estimation. REMP integrates the frequency information of typical channel features from the reconstruction error. MoCha-Stereo ranks 1st on the KITTI-2015 and KITTI-2012 Reflective leaderboards. Our structure also shows excellent performance in Multi-View Stereo. Code is avaliable at https://github.com/ZYangChen/MoCha-Stereo.

MoCha-Stereo: Motif Channel Attention Network for Stereo Matching

TL;DR

MoCha-Stereo addresses edge-detail mismatches in learning-based stereo by restoring geometric structure in feature channels through Motif Channel Attention (MCA) and a Motif Channel Correlation Volume (MCCV). It couples MCCV with a Reconstruction Error Motif Penalty (REMP) for full-resolution refinement, yielding state-of-the-art results on KITTI-2015 and KITTI-2012 Reflective, strong zero-shot generalization on Middlebury and ETH3D, and competitive MVS performance via MoCha-MVS. The method demonstrates that exploiting recurring geometric patterns in channel features and reconstruction-error motifs yields sharper edge matching and robust cross-domain performance. These contributions offer a practical path to high-fidelity disparity with efficient iterative refinement, enhancing edge fidelity in real-world stereo and MVS applications.

Abstract

Learning-based stereo matching techniques have made significant progress. However, existing methods inevitably lose geometrical structure information during the feature channel generation process, resulting in edge detail mismatches. In this paper, the Motif Cha}nnel Attention Stereo Matching Network (MoCha-Stereo) is designed to address this problem. We provide the Motif Channel Correlation Volume (MCCV) to determine more accurate edge matching costs. MCCV is achieved by projecting motif channels, which capture common geometric structures in feature channels, onto feature maps and cost volumes. In addition, edge variations in %potential feature channels of the reconstruction error map also affect details matching, we propose the Reconstruction Error Motif Penalty (REMP) module to further refine the full-resolution disparity estimation. REMP integrates the frequency information of typical channel features from the reconstruction error. MoCha-Stereo ranks 1st on the KITTI-2015 and KITTI-2012 Reflective leaderboards. Our structure also shows excellent performance in Multi-View Stereo. Code is avaliable at https://github.com/ZYangChen/MoCha-Stereo.
Paper Structure (17 sections, 10 equations, 6 figures, 7 tables)

This paper contains 17 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a) Comparison with SOTA methods lipson2021raftucfnetli2022practicaligev2023xu on KITTI 2012 kitti2012 and 2015 leaderboards kitti2015 (lower is better). (b) Performance evaluation of the Scene Flow test set sceneflow in comparison to IGEV-Stereo igev2023xu and DLNR 2023DLNR as the number of iterations changes (lower EPE means better).
  • Figure 2: Overview of MoCha-Stereo. MoCha-Stereo initially constructs the Motif Channel Correlation Volume (MCCV) by projecting the relationship between motif channels and normal channels into the basic group correlation volume. Subsequently, based on this cost volume, we employ an iterative way to build the disparity map. Finally, the Reconstruction Error Motif Penalty (REMP) module is applied to penalize the generation of the full-resolution disparity map. In REMP, $LFE$ refers to the Low-frequency Error branch, $LMC$ denotes to the Latent Motif Channel branch, and $HFE$ means the High-frequency Error branch. The star symbol means our primary innovations.
  • Figure 3: REMP module for Full-Resolution Refine. The upper branch (LFE) obtains low-frequency information through pooling, the lower branch (HFE) retains the original high-resolution image as high-frequency detailed information, and the middle branch (LMC) learns motif features through CNN.
  • Figure 4: Visualisation on the KITTI dataset. We conducted comparisons with existing SOTA methods lipson2021raftLac2022Liuigev2023xu. Our method accurately captures the edges of the left car and right roof in the first scene. In the second scene, it avoids confusion in the positions of the left two cars, and achieving a complete match of the edges of the right car doors. It is evident that our method excels in matching edge details.
  • Figure 5: Visualisation on the Middlebury dataset. All results presented in this section demonstrate zero-shot generalization on the Scene Flow dataset. The odd-numbered columns show the original images, the even-numbered columns present zoomed-in details.
  • ...and 1 more figures