Table of Contents
Fetching ...

IGEV++: Iterative Multi-range Geometry Encoding Volumes for Stereo Matching

Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Junda Cheng, Chunyuan Liao, Xin Yang

TL;DR

IGEV++ tackles stereo matching in ill-posed regions and large disparities by introducing Multi-range Geometry Encoding Volumes (MGEV) that encode coarse geometry for challenging areas while preserving fine-grained details. It combines adaptive patch matching (APM) and selective geometry feature fusion (SGFF) to construct and fuse multi-range, multi-granularity geometry information, which is then iteratively refined by ConvGRUs to update the disparity map. The method achieves state-of-the-art results on Scene Flow across disparity ranges up to 768px and on KITTI, Middlebury, and ETH3D benchmarks, with particularly strong performance in reflective/textureless regions and rapid convergence. A real-time variant, RT-IGEV, demonstrates real-time inference with competitive accuracy, and zero-shot generalization is shown on unseen real-world datasets, highlighting practical impact for real-world 3D perception systems.

Abstract

Stereo matching is a core component in many computer vision and robotics systems. Despite significant advances over the last decade, handling matching ambiguities in ill-posed regions and large disparities remains an open challenge. In this paper, we propose a new deep network architecture, called IGEV++, for stereo matching. The proposed IGEV++ constructs Multi-range Geometry Encoding Volumes (MGEV), which encode coarse-grained geometry information for ill-posed regions and large disparities, while preserving fine-grained geometry information for details and small disparities. To construct MGEV, we introduce an adaptive patch matching module that efficiently and effectively computes matching costs for large disparity ranges and/or ill-posed regions. We further propose a selective geometry feature fusion module to adaptively fuse multi-range and multi-granularity geometry features in MGEV. Then, we input the fused geometry features into ConvGRUs to iteratively update the disparity map. MGEV allows to efficiently handle large disparities and ill-posed regions, such as occlusions and textureless regions, and enjoys rapid convergence during iterations. Our IGEV++ achieves the best performance on the Scene Flow test set across all disparity ranges, up to 768px. Our IGEV++ also achieves state-of-the-art accuracy on the Middlebury, ETH3D, KITTI 2012, and 2015 benchmarks. Specifically, IGEV++ achieves a 3.23\% 2-pixel outlier rate (Bad 2.0) on the large disparity benchmark, Middlebury, representing error reductions of 31.9\% and 54.8\% compared to RAFT-Stereo and GMStereo, respectively. We also present a real-time version of IGEV++ that achieves the best performance among all published real-time methods on the KITTI benchmarks. The code is publicly available at https://github.com/gangweix/IGEV and https://github.com/gangweix/IGEV-plusplus.

IGEV++: Iterative Multi-range Geometry Encoding Volumes for Stereo Matching

TL;DR

IGEV++ tackles stereo matching in ill-posed regions and large disparities by introducing Multi-range Geometry Encoding Volumes (MGEV) that encode coarse geometry for challenging areas while preserving fine-grained details. It combines adaptive patch matching (APM) and selective geometry feature fusion (SGFF) to construct and fuse multi-range, multi-granularity geometry information, which is then iteratively refined by ConvGRUs to update the disparity map. The method achieves state-of-the-art results on Scene Flow across disparity ranges up to 768px and on KITTI, Middlebury, and ETH3D benchmarks, with particularly strong performance in reflective/textureless regions and rapid convergence. A real-time variant, RT-IGEV, demonstrates real-time inference with competitive accuracy, and zero-shot generalization is shown on unseen real-world datasets, highlighting practical impact for real-world 3D perception systems.

Abstract

Stereo matching is a core component in many computer vision and robotics systems. Despite significant advances over the last decade, handling matching ambiguities in ill-posed regions and large disparities remains an open challenge. In this paper, we propose a new deep network architecture, called IGEV++, for stereo matching. The proposed IGEV++ constructs Multi-range Geometry Encoding Volumes (MGEV), which encode coarse-grained geometry information for ill-posed regions and large disparities, while preserving fine-grained geometry information for details and small disparities. To construct MGEV, we introduce an adaptive patch matching module that efficiently and effectively computes matching costs for large disparity ranges and/or ill-posed regions. We further propose a selective geometry feature fusion module to adaptively fuse multi-range and multi-granularity geometry features in MGEV. Then, we input the fused geometry features into ConvGRUs to iteratively update the disparity map. MGEV allows to efficiently handle large disparities and ill-posed regions, such as occlusions and textureless regions, and enjoys rapid convergence during iterations. Our IGEV++ achieves the best performance on the Scene Flow test set across all disparity ranges, up to 768px. Our IGEV++ also achieves state-of-the-art accuracy on the Middlebury, ETH3D, KITTI 2012, and 2015 benchmarks. Specifically, IGEV++ achieves a 3.23\% 2-pixel outlier rate (Bad 2.0) on the large disparity benchmark, Middlebury, representing error reductions of 31.9\% and 54.8\% compared to RAFT-Stereo and GMStereo, respectively. We also present a real-time version of IGEV++ that achieves the best performance among all published real-time methods on the KITTI benchmarks. The code is publicly available at https://github.com/gangweix/IGEV and https://github.com/gangweix/IGEV-plusplus.
Paper Structure (21 sections, 12 equations, 12 figures, 9 tables)

This paper contains 21 sections, 12 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Left: Comparisons with state-of-the-art stereo methods pcwnetgwcnetunistereoraft-stereo across different disparity ranges on the Scene Flow test set dispNetC Our IGEV++ outperforms previously published methods by a large margin across all disparity ranges. Right: Comparisons with state-of-the-art stereo methods crestereoraft-stereodlnrhitnetunistereocroco-stereo on Middlebury middlebury and KITTI kitti2015 leaderboards. Our IGEV++ achieves the best performance.
  • Figure 2: Row 1: Visual comparisons with state-of-the-art methods pcwnetdlnrunistereo in large disparity regions on the Scene Flow test set dispNetC. PCWNet pcwnet is a volume filtering-based method, DLNR dlnr is an iterative optimization-based method, and GMStereo unistereo is a transformer-based method. They all struggle to handle large disparities in large textureless objects at a close range. Row 2: Zero-shot generalization results on Middlebury middlebury. Our IGEV++ effectively handles large disparities in textureless regions and also distinguishes subtle details in complex backgrounds.
  • Figure 3: Network architecture of the proposed IGEV++. The IGEV++ first builds Multi-range Geometry Encoding Volumes (MGEV) via Adaptive Patch Matching (APM). MEGV encodes coarse-grained geometry information of the scene for textureless regions and large disparities and fine-grained geometry information for details and small disparities after 3D aggregation or regularization. Then we regress an initial disparity map from MGEV through $soft \; argmin$, which serves as the starting point for ConvGRUs. In each iteration, we index multi-range and multi-granularity geometry features from MGEV, selectively fuse them, and then input them into ConvGRUs to update the disparity field.
  • Figure 4: Comparison with the state-of-the-art transformer-based method GMStereo unistereo in ill-posed and large disparity regions on the Scene Flow test set.
  • Figure 5: EPE (Disp<768px) vs. number of iterations at inference. The figure exhibits the prediction results on the Scene Flow test set at different iteration numbers during inference. Our IGEV++ converges faster and reaches a lower convergence point.
  • ...and 7 more figures