Table of Contents
Fetching ...

Meta-Optimization for Higher Model Generalizability in Single-Image Depth Prediction

Cho-Ying Wu, Yiqi Zhong, Junying Wang, Ulrich Neumann

TL;DR

This work tackles the problem of generalizing monocular depth prediction for indoor scenes to unseen environments. It introduces gradient-based meta-learning that treats each RGB-D pair as a fine-grained task, learning a depth prior $\theta^{prior}$ through a bilevel optimization with a base- and meta-optimizer, followed by conventional supervised training to yield $\theta^*$. The approach enables zero-shot cross-dataset transfer and improves depth accuracy without additional data or pretrained networks, achieving notable gains on cross-dataset protocols and even improving 3D representations for NeRF-style rendering. Overall, the method provides a simple, plug-in meta-initialization that enhances generalization in depth-from-single-image tasks and encourages practical deployment across diverse indoor scenes.

Abstract

Model generalizability to unseen datasets, concerned with in-the-wild robustness, is less studied for indoor single-image depth prediction. We leverage gradient-based meta-learning for higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied image classification in meta-learning, depth is pixel-level continuous range values, and mappings from each image to depth vary widely across environments. Thus no explicit task boundaries exist. We instead propose fine-grained task that treats each RGB-D pair as a task in our meta-optimization. We first show meta-learning on limited data induces much better prior (max +29.4\%). Using meta-learned weights as initialization for following supervised learning, without involving extra data or information, it consistently outperforms baselines without the method. Compared to most indoor-depth methods that only train/ test on a single dataset, we propose zero-shot cross-dataset protocols, closely evaluate robustness, and show consistently higher generalizability and accuracy by our meta-initialization. The work at the intersection of depth and meta-learning potentially drives both research streams to step closer to practical use.

Meta-Optimization for Higher Model Generalizability in Single-Image Depth Prediction

TL;DR

This work tackles the problem of generalizing monocular depth prediction for indoor scenes to unseen environments. It introduces gradient-based meta-learning that treats each RGB-D pair as a fine-grained task, learning a depth prior through a bilevel optimization with a base- and meta-optimizer, followed by conventional supervised training to yield . The approach enables zero-shot cross-dataset transfer and improves depth accuracy without additional data or pretrained networks, achieving notable gains on cross-dataset protocols and even improving 3D representations for NeRF-style rendering. Overall, the method provides a simple, plug-in meta-initialization that enhances generalization in depth-from-single-image tasks and encourages practical deployment across diverse indoor scenes.

Abstract

Model generalizability to unseen datasets, concerned with in-the-wild robustness, is less studied for indoor single-image depth prediction. We leverage gradient-based meta-learning for higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied image classification in meta-learning, depth is pixel-level continuous range values, and mappings from each image to depth vary widely across environments. Thus no explicit task boundaries exist. We instead propose fine-grained task that treats each RGB-D pair as a task in our meta-optimization. We first show meta-learning on limited data induces much better prior (max +29.4\%). Using meta-learned weights as initialization for following supervised learning, without involving extra data or information, it consistently outperforms baselines without the method. Compared to most indoor-depth methods that only train/ test on a single dataset, we propose zero-shot cross-dataset protocols, closely evaluate robustness, and show consistently higher generalizability and accuracy by our meta-initialization. The work at the intersection of depth and meta-learning potentially drives both research streams to step closer to practical use.
Paper Structure (8 sections, 1 equation, 3 figures)

This paper contains 8 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: Geometry structure comparison in 3D point cloud view. We back-project the predicted depth maps from images into textured 3D point cloud to show the geometry. The proposed Meta-Initialization has better domain generalizability that leads to more accurate depth prediction hence better 3D structures. (zoom in for the best view).
  • Figure 2: Meta-Initialization for learning image-to-depth mappings. The prior learning stage adopts a base-optimizer and a meta-optimizer. Inside each meta-iteration, $K$ fine-grained tasks are sampled and used to minimize regression loss. $L$ steps are taken by the base-optimizer to search for weight update directions for these $K$ tasks. Then, the meta-optimizer follows the explored inner trends to update meta-parameters in the Reptile style nichol2018first. Image-to-depth prior $\theta^{prior}$ is output at the end of the stage. $\theta^{prior}$ is then used as the initialization for the subsequent supervised learning for the final model $\theta^*$.
  • Figure 3: Fitting to training environments.var shows depth variance in the highlighted regions. We show comparison of fitting to training environments between pure meta-learning (Meta) and direct supervised learning (DSL) on limited scene-variety dataset, Replica. Meta produces smooth and more precise depth. Depth-irrelevant textures on planar regions can be resolved more correctly. In contrast, DSL produces irregularities affected by local high-frequency details, especially ResNet50. See Sec. \ref{['experiments:scene-fitting']} for details and \ref{['explanation']} for the explanation.