Table of Contents
Fetching ...

Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization

Cho-Ying Wu, Yiqi Zhong, Junying Wang, Ulrich Neumann

TL;DR

This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference and proposes zero-shot cross-dataset protocols and validate higher generalizability induced by the meta-initialization, as a simple and useful plugin to many existing depth estimation methods.

Abstract

Indoor robots rely on depth to perform tasks like navigation or obstacle detection, and single-image depth estimation is widely used to assist perception. Most indoor single-image depth prediction focuses less on model generalizability to unseen datasets, concerned with in-the-wild robustness for system deployment. This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied meta-learning of image classification associated with explicit class labels, no explicit task boundaries exist for continuous depth values tied to highly varying indoor environments regarding object arrangement and scene composition. We propose fine-grained task that treats each RGB-D mini-batch as a task in our meta-learning formulation. We first show that our method on limited data induces a much better prior (max 27.8% in RMSE). Then, finetuning on meta-learned initialization consistently outperforms baselines without the meta approach. Aiming at generalization, we propose zero-shot cross-dataset protocols and validate higher generalizability induced by our meta-initialization, as a simple and useful plugin to many existing depth estimation methods. The work at the intersection of depth and meta-learning potentially drives both research to step closer to practical robotic and machine perception usage.

Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization

TL;DR

This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference and proposes zero-shot cross-dataset protocols and validate higher generalizability induced by the meta-initialization, as a simple and useful plugin to many existing depth estimation methods.

Abstract

Indoor robots rely on depth to perform tasks like navigation or obstacle detection, and single-image depth estimation is widely used to assist perception. Most indoor single-image depth prediction focuses less on model generalizability to unseen datasets, concerned with in-the-wild robustness for system deployment. This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied meta-learning of image classification associated with explicit class labels, no explicit task boundaries exist for continuous depth values tied to highly varying indoor environments regarding object arrangement and scene composition. We propose fine-grained task that treats each RGB-D mini-batch as a task in our meta-learning formulation. We first show that our method on limited data induces a much better prior (max 27.8% in RMSE). Then, finetuning on meta-learned initialization consistently outperforms baselines without the meta approach. Aiming at generalization, we propose zero-shot cross-dataset protocols and validate higher generalizability induced by our meta-initialization, as a simple and useful plugin to many existing depth estimation methods. The work at the intersection of depth and meta-learning potentially drives both research to step closer to practical robotic and machine perception usage.
Paper Structure (5 sections, 1 equation, 2 figures)

This paper contains 5 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: Geometry structure comparison in 3D point cloud view. We back-project the predicted depth maps from images into textured 3D point cloud to show the geometry. The proposed Meta-Initialization has better domain generalizability that leads to more accurate depth prediction hence better 3D structures. (zoom in for the best view).
  • Figure 2: Fitting to training environments.var shows variance for depth values in the highlighted regions. We show comparisons of fitting to training data between first-stage meta-learning (Meta) and direct supervised learning (DSL) using Replica Dataset that contains limited scene appearance and depth variation (scene variety). Meta produces smooth and more precise depth. Depth-irrelevant textures on planar regions can be resolved more correctly. In contrast, DSL produces irregularities affected by local high-frequency details, especially ResNet50. See Sec. \ref{['experiments:scene-fitting']} for details and \ref{['explanation']} for the explanation.