Depth-aware Volume Attention for Texture-less Stereo Matching

Tong Zhao; Mingyu Ding; Wei Zhan; Masayoshi Tomizuka; Yintao Wei

Depth-aware Volume Attention for Texture-less Stereo Matching

Tong Zhao, Mingyu Ding, Wei Zhan, Masayoshi Tomizuka, Yintao Wei

TL;DR

This work tackles texture-less stereo matching by proposing DVANet, a lightweight framework that refines a 4D cost/disparity volume through a depth-aware texture hierarchy and a target-aware disparity attention mechanism. A depth estimation branch constructs a depth volume that modulates features via channel attention, guiding the network to preserve texture details at larger depths, while a dedicated disparity attention module focuses aggregation around the target disparity to reduce ambiguity. The authors also introduce Weighted Relative Depth Error (WRDE), a depth-aware metric that emphasizes performance across near, middle, and far depth ranges and unifies evaluation for stereo and depth estimation tasks. Extensive experiments on synthetic and real-world datasets demonstrate superior performance in texture-less scenarios, with competitive results on standard benchmarks and improved depth-wise reliability in outdoor environments. The approach offers a practical, efficient solution for robust 3D perception in challenging outdoor conditions and provides a principled evaluation tool via WRDE.

Abstract

Stereo matching plays a crucial role in 3D perception and scenario understanding. Despite the proliferation of promising methods, addressing texture-less and texture-repetitive conditions remains challenging due to the insufficient availability of rich geometric and semantic information. In this paper, we propose a lightweight volume refinement scheme to tackle the texture deterioration in practical outdoor scenarios. Specifically, we introduce a depth volume supervised by the ground-truth depth map, capturing the relative hierarchy of image texture. Subsequently, the disparity discrepancy volume undergoes hierarchical filtering through the incorporation of depth-aware hierarchy attention and target-aware disparity attention modules. Local fine structure and context are emphasized to mitigate ambiguity and redundancy during volume aggregation. Furthermore, we propose a more rigorous evaluation metric that considers depth-wise relative error, providing comprehensive evaluations for universal stereo matching and depth estimation models. We extensively validate the superiority of our proposed methods on public datasets. Results demonstrate that our model achieves state-of-the-art performance, particularly excelling in scenarios with texture-less images. The code is available at https://github.com/ztsrxh/DVANet.

Depth-aware Volume Attention for Texture-less Stereo Matching

TL;DR

Abstract

Paper Structure (15 sections, 12 equations, 6 figures, 5 tables)

This paper contains 15 sections, 12 equations, 6 figures, 5 tables.

Introduction
Related Works
Methods
Weighted Relative Depth Error
Network Architecture
Depth-aware Hierarchy Attention
Target-aware Disparity Attention
Loss Function
Experiments
Datasets and Metrics
Implementation Details
Comparison with State-of-the-art
Effectiveness of WRDE
Ablation Studies
Conclusion

Figures (6)

Figure 1: Our motivation. The perspective effect leads to texture deterioration in natural scenarios. Texture is relatively rich at small depth, while degenerates at farther distance.
Figure 2: Texture attention. From up to down: left image, texture hierarchy attention map, and predicted disparity map. Our DVANet concentrates on the texture-less and weak-texture areas such as the over-exposed regions, white windows, and roads at far distance in the images.
Figure 3: The architecture of the proposed DVANet. It contains three novel designs: discrepancy volume adapted to texture-less matching, depth-aware texture hierarchy attention, and target-aware disparity attention. The network is guided to focus on the relative texture hierarchy thus achieving more reliable matching in weak-texture areas.
Figure 4: Visualization of relative depth error w.r.t. depth. Results are from the cross-domain generalization evaluation on the KITTI 2015. The errors increase with depth, and the models exhibit distinct performance.
Figure 5: Visualization of point clouds converted from disparity maps on RSRD. Our DVANet delivers accurate and stable disparity estimation on the extremely texture-less plan road. Fine local structures are accurately recovered even at far distance.
...and 1 more figures

Depth-aware Volume Attention for Texture-less Stereo Matching

TL;DR

Abstract

Depth-aware Volume Attention for Texture-less Stereo Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (6)