Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Jiawei Yao; Tong Wu; Xiaofeng Zhang

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Jiawei Yao, Tong Wu, Xiaofeng Zhang

TL;DR

This work analyzes how Transformer and CNN architectures approach monocular depth estimation using sparse-pixel visualization, revealing that Transformers leverage global context but struggle with depth gradient continuity. It introduces the Depth Gradient Refinement (DGR) module, which incorporates higher-order depth derivatives and attention-based fusion to enhance gradient continuity, and the Optimal Transport Depth Loss (OTDL) to preserve depth distributions during training. When integrated with state-of-the-art Transformer models, DGR and OTDL yield new performance gains on NYU-Depth-V2 and KITTI, with competitive edge preservation and without added computational cost. The study provides both practical improvements for depth estimation and interpretability insights into Transformer-based depth cues, guiding future methodology in this domain.

Abstract

Monocular depth estimation is an ongoing challenge in computer vision. Recent progress with Transformer models has demonstrated notable advantages over conventional CNNs in this area. However, there's still a gap in understanding how these models prioritize different regions in 2D images and how these regions affect depth estimation performance. To explore the differences between Transformers and CNNs, we employ a sparse pixel approach to contrastively analyze the distinctions between the two. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. To further enhance the performance of Transformer models in monocular depth estimation, we propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration. Additionally, we leverage optimal transport theory, treating depth maps as spatial probability distributions, and employ the optimal transport distance as a loss function to optimize our model. Experimental results demonstrate that models integrated with the plug-and-play Depth Gradient Refinement (DGR) module and the proposed loss function enhance performance without increasing complexity and computational costs on both outdoor KITTI and indoor NYU-Depth-v2 datasets. This research not only offers fresh insights into the distinctions between Transformers and CNNs in depth estimation but also paves the way for novel depth estimation methodologies.

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

TL;DR

Abstract

Paper Structure (17 sections, 23 equations, 8 figures, 4 tables)

This paper contains 17 sections, 23 equations, 8 figures, 4 tables.

Introduction
Related Work
CNNs and Transformers for Monocular Depth Estimation
Visualization in Deep Learning Models
Methodology
Visualization of Monocular Depth Estimation
Depth Gradient Refinement Module
Optimal Transport Depth Loss
Experiments
Datasets
Models
Transformer Estimation Depth from Sparse Pixels
Comparison to previous state-of-the-art competitors
NYU-Depth-V2 Results
KITTI Results
...and 2 more sections

Figures (8)

Figure 1: Visualization comparison of depth estimation using Transformers and CNNs. From left to right: RGB, depth prediction using a Transformer encoder, and depth prediction using CNN. Both results are obtained under identical data processing and loss conditions. Depth maps estimated by the Transformer method exhibit clearer scene structures than those by the CNN method, while CNNs provide smoother depth estimations at object boundaries.
Figure 2: Network architecture for visualizing Depth Estimation differences between CNN and Transformer. This figure illustrates a dual-pathway architecture for depth estimation from RGB input. It includes two parallel branches with model-mask networks ($G_C$ and $G_T$) generating predictive masks from the input image. The masks are then element-wise multiplied with the RGB input to isolate features of interest. The resulting representations are fed into pre-trained and weight-fixed depth prediction networks (“CNN-depth” and “Transformer-depth”), which independently predict depth maps (Predict-depth(C) and Predict-depth(T)). The model-masks networks are trained to mask out regions that are not of interest to the depth prediction models, enhancing the alignment of the predicted depth maps with the GTs.
Figure 3: Visualization of differences. From left to right: RGB, image gradient, masks predicted by Transformer and CNN, and depth predictions post-sparsification.
Figure 4: Schematic of the Depth Gradient Refinement (DGR) module. Features processed by the transformer encoder serve as inputs to the DGR module, subsequently undergoing higher-order derivative computation, feature concatenation, and feature recalibration.
Figure 5: Performance of the two networks at different sparsity levels. A larger RMSE indicates poorer performance.
...and 3 more figures

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

TL;DR

Abstract

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Authors

TL;DR

Abstract

Table of Contents

Figures (8)