Table of Contents
Fetching ...

USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network

Joseph Emmanuel DL Dayo, Prospero C. Naval

TL;DR

USAM-Net addresses high-precision stereo depth estimation in driving scenarios by fusing left-right stereo inputs with semantic segmentation maps from a pre-trained Segment Anything Model. The method uses a U‑Net–style backbone augmented with a self-attention layer, forming a dual-pathway architecture that integrates segmentation cues into disparity prediction. On DrivingStereo, USAM-Net achieves state-of-the-art Global Difference $\approx 3.61\%$ and End-Point Error $\approx 0.88$, with segmentation and attention providing complementary gains and improving detail in regions with distinctive features; KITTI 2015 shows attention benefits after fine-tuning, while Middlebury indicates domain sensitivity. The results demonstrate the value of combining semantic guidance and self-attention for accurate, real-time disparity estimation in driving contexts, and point to future work on broader datasets and cost-efficient segmentation options.

Abstract

The increasing demand for high-accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM-Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM-Net employs a dual-pathway architecture, which combines a pre-trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61\% and an End-Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM-Net in applications demanding high-precision depth data.

USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network

TL;DR

USAM-Net addresses high-precision stereo depth estimation in driving scenarios by fusing left-right stereo inputs with semantic segmentation maps from a pre-trained Segment Anything Model. The method uses a U‑Net–style backbone augmented with a self-attention layer, forming a dual-pathway architecture that integrates segmentation cues into disparity prediction. On DrivingStereo, USAM-Net achieves state-of-the-art Global Difference and End-Point Error , with segmentation and attention providing complementary gains and improving detail in regions with distinctive features; KITTI 2015 shows attention benefits after fine-tuning, while Middlebury indicates domain sensitivity. The results demonstrate the value of combining semantic guidance and self-attention for accurate, real-time disparity estimation in driving contexts, and point to future work on broader datasets and cost-efficient segmentation options.

Abstract

The increasing demand for high-accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM-Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM-Net employs a dual-pathway architecture, which combines a pre-trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61\% and an End-Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM-Net in applications demanding high-precision depth data.

Paper Structure

This paper contains 18 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Basic Architecture of USAM-Net
  • Figure 2: The Self-Attention Layer
  • Figure 3: Driving Stereo Dataset
  • Figure 4: Comparison with masking the sky during training. Notice the artifacts on the top portion when there is no masking.
  • Figure 5: Comparison of Absolute Relative Difference (ARD) curves showing disparity errors across depth intervals for the three models
  • ...and 5 more figures