Table of Contents
Fetching ...

NAIMA: Semantics Aware RGB Guided Depth Super-Resolution

Tayyab Nasir, Daochang Liu, Ajmal Mian

Abstract

Guided depth super-resolution (GDSR) is a multi-modal approach for depth map super-resolution that relies on a low-resolution depth map and a high-resolution RGB image to restore finer structural details. However, the misleading color and texture cues indicating depth discontinuities in RGB images often lead to artifacts and blurred depth boundaries in the generated depth map. We propose a solution that introduces global contextual semantic priors, generated from pretrained vision transformer token embeddings. Our approach to distilling semantic knowledge from pretrained token embeddings is motivated by their demonstrated effectiveness in related monocular depth estimation tasks. We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer. Additionally, we present an architecture called Neural Attention for Implicit Multi-token Alignment (NAIMA), which integrates DINOv2 with GTA blocks for a semantics-aware GDSR. Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.

NAIMA: Semantics Aware RGB Guided Depth Super-Resolution

Abstract

Guided depth super-resolution (GDSR) is a multi-modal approach for depth map super-resolution that relies on a low-resolution depth map and a high-resolution RGB image to restore finer structural details. However, the misleading color and texture cues indicating depth discontinuities in RGB images often lead to artifacts and blurred depth boundaries in the generated depth map. We propose a solution that introduces global contextual semantic priors, generated from pretrained vision transformer token embeddings. Our approach to distilling semantic knowledge from pretrained token embeddings is motivated by their demonstrated effectiveness in related monocular depth estimation tasks. We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer. Additionally, we present an architecture called Neural Attention for Implicit Multi-token Alignment (NAIMA), which integrates DINOv2 with GTA blocks for a semantics-aware GDSR. Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.

Paper Structure

This paper contains 19 sections, 11 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Blurred depth discontinuities caused by RGB noise when performing super-resolution without semantic guidance. In contrast, our semantics-aware approach leverages global contextual information to better preserve structural fidelity.
  • Figure 2: Overview of Neural Attention for Implicit Multi-token Alignment (NAIMA) architecture. The semantic encoder is based on DINOv2 and extracts high-level semantic representations from the RGB input. GTA is our proposed Guided Token Attention module, which uses cross-attention to inject relevant semantic features into the depth feature maps, along with integration of spatial RGB information. The Upsampler consists of a series of convolutional, deconvolutional, and residual channel attention layers that progressively reconstruct and spatially upscale the final depth features.
  • Figure 3: Overview of the Guided Token Attention (GTA) module. This module encodes spatial, depth, and semantic features, and merges them using cross attention, to produce a semantically enriched depth feature map while preserving geometric details from both low-resolution depth and RGB inputs.
  • Figure 4: Comparison of depth maps at 8x scaling factor across multiple evaluation datasets. Bicubic denotes the bicubic-upsampled low-resolution input, while GT represents the ground-truth depth map. The bounding box denotes the corresponding area in the RGB image. It can be observed that NAIMA not only improves depth boundary reconstruction but also adheres to structural details while reconstructing the depth maps.
  • Figure 5: Comparison of error maps for a selected patch from the Middlebury dataset at 8x upscaling. The bounding box indicates the corresponding region in the RGB image.
  • ...and 4 more figures