PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Ryutaro Miya; Kazuyoshi Fushinobu; Tatsuya Kawaguchi

PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Ryutaro Miya, Kazuyoshi Fushinobu, Tatsuya Kawaguchi

Abstract

We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: https://github.com/ryutaroLF/PureCLIP-Depth

PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Abstract

Paper Structure (22 sections, 15 equations, 4 figures, 5 tables)

This paper contains 22 sections, 15 equations, 4 figures, 5 tables.

Introduction
Method
Input: Patch-wise RGB Image
Output: Depth Map
Learnable Depth Table
Rotating Vectors in CLIP Embedding Space
Alternating Optimization
Loss Functions
InfoNCE Loss
Alignment Loss
RMSE Loss
Experiment
Datasets
Evaluation
Training
...and 7 more sections

Figures (4)

Figure 1: Overview of the proposed architecture. Aside from the CLIP text and image encoders, all operations are performed strictly within the conceptual CLIP embedding space.
Figure 2: Analysis of depth estimation performance across different distance ranges. (a) Joint distributions of predicted versus ground-truth depth for the NYU Depth V2 and KITTI datasets. (b) Empirical uncertainty, calculated as the standard deviation ($\sigma$), with respect to the ground-truth depth. (c) Histograms illustrating the distribution of ground-truth depth (total sample count) in each dataset. All plots are generated within the ranges of 1.0 to 9.7 m for NYU Depth V2 and 1.0 to 23.0 m for KITTI; the initial bin for NYU (0.33 m) and far-range bins for KITTI (25.0, 27.0, and 29.0 m) are trimmed due to the absence of sufficient ground-truth samples in those intervals.
Figure 3: Qualitative visualization on NYU Depth V2 test set. Compared to Auty & Mikolajczyk auty2023learning, our method shows significant improvements in depth quality for (a, b) chair shapes, (c) depth variations across the bed, (d) regions with missing values in the library aisle, (e) missing values in the shop background, and (f) the central desk and the entrance to the back room.
Figure 4: Qualitative visualization on KITTI test set. Compared to Auty & Mikolajczyk auty2023learning, our method more accurately reflects depth for (a) the shapes of street trees, (b) cyclists, (c) parked cars, and (d) pedestrians in the shade.

PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Abstract

PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Authors

Abstract

Table of Contents

Figures (4)