Table of Contents
Fetching ...

Generalized Geometry Encoding Volume for Real-time Stereo Matching

Jiaxin Liu, Gangwei Xu, Xianqi Wang, Chengliang Zhang, Xin Yang

TL;DR

The paper tackles the need for fast stereo matching that generalizes to unseen scenes. It introduces Generalized Geometry Encoding Volume (GGEV), which combines texture and monocular depth priors via Selective Channel Fusion and Depth-aware Dynamic Cost Aggregation to produce a robust, lightweight cost volume. A depth-guided iterative refinement stage (GRU-based) yields accurate disparities while maintaining real-time speeds. Experimental results on KITTI 2012/2015 and ETH3D demonstrate state-of-the-art real-time performance with strong zero-shot generalization, outperforming existing fast methods by large margins, including in challenging regions.

Abstract

Real-time stereo matching methods primarily focus on enhancing in-domain performance but often overlook the critical importance of generalization in real-world applications. In contrast, recent stereo foundation models leverage monocular foundation models (MFMs) to improve generalization, but typically suffer from substantial inference latency. To address this trade-off, we propose Generalized Geometry Encoding Volume (GGEV), a novel real-time stereo matching network that achieves strong generalization. We first extract depth-aware features that encode domain-invariant structural priors as guidance for cost aggregation. Subsequently, we introduce a Depth-aware Dynamic Cost Aggregation (DDCA) module that adaptively incorporates these priors into each disparity hypothesis, effectively enhancing fragile matching relationships in unseen scenes. Both steps are lightweight and complementary, leading to the construction of a generalized geometry encoding volume with strong generalization capability. Experimental results demonstrate that our GGEV surpasses all existing real-time methods in zero-shot generalization capability, and achieves state-of-the-art performance on the KITTI 2012, KITTI 2015, and ETH3D benchmarks.

Generalized Geometry Encoding Volume for Real-time Stereo Matching

TL;DR

The paper tackles the need for fast stereo matching that generalizes to unseen scenes. It introduces Generalized Geometry Encoding Volume (GGEV), which combines texture and monocular depth priors via Selective Channel Fusion and Depth-aware Dynamic Cost Aggregation to produce a robust, lightweight cost volume. A depth-guided iterative refinement stage (GRU-based) yields accurate disparities while maintaining real-time speeds. Experimental results on KITTI 2012/2015 and ETH3D demonstrate state-of-the-art real-time performance with strong zero-shot generalization, outperforming existing fast methods by large margins, including in challenging regions.

Abstract

Real-time stereo matching methods primarily focus on enhancing in-domain performance but often overlook the critical importance of generalization in real-world applications. In contrast, recent stereo foundation models leverage monocular foundation models (MFMs) to improve generalization, but typically suffer from substantial inference latency. To address this trade-off, we propose Generalized Geometry Encoding Volume (GGEV), a novel real-time stereo matching network that achieves strong generalization. We first extract depth-aware features that encode domain-invariant structural priors as guidance for cost aggregation. Subsequently, we introduce a Depth-aware Dynamic Cost Aggregation (DDCA) module that adaptively incorporates these priors into each disparity hypothesis, effectively enhancing fragile matching relationships in unseen scenes. Both steps are lightweight and complementary, leading to the construction of a generalized geometry encoding volume with strong generalization capability. Experimental results demonstrate that our GGEV surpasses all existing real-time methods in zero-shot generalization capability, and achieves state-of-the-art performance on the KITTI 2012, KITTI 2015, and ETH3D benchmarks.

Paper Structure

This paper contains 47 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Zero-shot generalization comparison. All models are trained on Scene Flow and tested on KITTI, Middlebury, and ETH3D. GGEV achieves comparable speed to RT-IGEV while offering improved generalization on unseen scenes.
  • Figure 2: Overview of our proposed GGEV. The Selective Channel Fusion (SCF) module integrates texture features with depth features as a guidance for cost aggregation. Then, the Depth-aware Dynamic Cost Aggregation (DDCA) module adaptively incorporates depth structural priors to enhance the fragile matching relationships in the initial cost volume, resulting in a generalized geometry encoding volume.
  • Figure 3: Effectiveness of our DDCA in generalization evaluation. The first row show the initial cost volume features across different disparity hypotheses, which are fragile in unseen scenes and contain many mismatches. In contrast, the second row shows the results after applying our DDCA, which effectively filters out incorrect matches and preserves accurate matching features at their corresponding disparity planes, leading to clearer and more reliable structures.
  • Figure 4: The architecture of proposed DDCA.
  • Figure 5: Qualitative comparison on ETH3D.
  • ...and 5 more figures