Table of Contents
Fetching ...

MetaFE-DE: Learning Meta Feature Embedding for Depth Estimation from Monocular Endoscopic Images

Dawei Lu, Deqiang Xiao, Danni Ai, Jingfan Fan, Tianyu Fu, Yucong Lin, Hong Song, Xujiong Ye, Lei Zhang, Jian Yang

TL;DR

Depth estimation from monocular endoscopic images is challenged by irregular tissue shapes and varying lighting. The authors introduce MetaFE-DE, a two-stage self-supervised framework that learns MetaFE, a shared latent representation of the physical scene that can be decoded into RGB or depth images. Phase 1 uses a temporal latent diffusion model with cross normalization to fuse temporal and spatial cues into MetaFE; Phase 2 decodes MetaFE into depth via a brightness-calibrated monocular depth estimation pipeline. Across SCARED, EndoSLAM, and Hamlyn, MetaFE-DE achieves state-of-the-art accuracy and demonstrates strong cross-dataset generalization, with open-source code to follow.”

Abstract

Depth estimation from monocular endoscopic images presents significant challenges due to the complexity of endoscopic surgery, such as irregular shapes of human soft tissues, as well as variations in lighting conditions. Existing methods primarily estimate the depth information from RGB images directly, and often surffer the limited interpretability and accuracy. Given that RGB and depth images are two views of the same endoscopic surgery scene, in this paper, we introduce a novel concept referred as ``meta feature embedding (MetaFE)", in which the physical entities (e.g., tissues and surgical instruments) of endoscopic surgery are represented using the shared features that can be alternatively decoded into RGB or depth image. With this concept, we propose a two-stage self-supervised learning paradigm for the monocular endoscopic depth estimation. In the first stage, we propose a temporal representation learner using diffusion models, which are aligned with the spatial information through the cross normalization to construct the MetaFE. In the second stage, self-supervised monocular depth estimation with the brightness calibration is applied to decode the meta features into the depth image. Extensive evaluation on diverse endoscopic datasets demonstrates that our approach outperforms the state-of-the-art method in depth estimation, achieving superior accuracy and generalization. The source code will be publicly available.

MetaFE-DE: Learning Meta Feature Embedding for Depth Estimation from Monocular Endoscopic Images

TL;DR

Depth estimation from monocular endoscopic images is challenged by irregular tissue shapes and varying lighting. The authors introduce MetaFE-DE, a two-stage self-supervised framework that learns MetaFE, a shared latent representation of the physical scene that can be decoded into RGB or depth images. Phase 1 uses a temporal latent diffusion model with cross normalization to fuse temporal and spatial cues into MetaFE; Phase 2 decodes MetaFE into depth via a brightness-calibrated monocular depth estimation pipeline. Across SCARED, EndoSLAM, and Hamlyn, MetaFE-DE achieves state-of-the-art accuracy and demonstrates strong cross-dataset generalization, with open-source code to follow.”

Abstract

Depth estimation from monocular endoscopic images presents significant challenges due to the complexity of endoscopic surgery, such as irregular shapes of human soft tissues, as well as variations in lighting conditions. Existing methods primarily estimate the depth information from RGB images directly, and often surffer the limited interpretability and accuracy. Given that RGB and depth images are two views of the same endoscopic surgery scene, in this paper, we introduce a novel concept referred as ``meta feature embedding (MetaFE)", in which the physical entities (e.g., tissues and surgical instruments) of endoscopic surgery are represented using the shared features that can be alternatively decoded into RGB or depth image. With this concept, we propose a two-stage self-supervised learning paradigm for the monocular endoscopic depth estimation. In the first stage, we propose a temporal representation learner using diffusion models, which are aligned with the spatial information through the cross normalization to construct the MetaFE. In the second stage, self-supervised monocular depth estimation with the brightness calibration is applied to decode the meta features into the depth image. Extensive evaluation on diverse endoscopic datasets demonstrates that our approach outperforms the state-of-the-art method in depth estimation, achieving superior accuracy and generalization. The source code will be publicly available.

Paper Structure

This paper contains 35 sections, 18 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: This paper proposes the MetaFE that represents physical entities in the endoscopic surgical scene, providing a comprehensive description of the complex surgical environment. This features can be decoded into either RGB or depth image, with the potential to generate more accurate depth estimation.
  • Figure 2: Prior studiesplatonic suggest that the text and image jointly represent the same entity in the physical world. This paper, however, categorizes modalities into non-anthropocentric space and anthropocentrically defined space based on their susceptibility to human cognition. For example, the modalities such as RGB and depth image are unaffected by human cognition, thus they are able to reflect the intrinsic physical properties.
  • Figure 3: The structure of the proposed framework (MetaFE-DE), which consists of the two phases, i.e., meta feature generation and decoding.
  • Figure 4: By decoding the depth information from MetaFE, our method generates the depth images with more accurate details compared with three related methods.
  • Figure 5: Feature similarity using CKA, with axes representing network layers and cell values indicating similarity. A: Similarity at each layer between the depth and RGB decoders (depth decoder trained with RGB pre-trained weights). B: Similarity at each layer between the depth decoder (trained from scratch) and the RGB decoder. C: Intra-layer similarity within the depth decoder (trained from scratch). D: Intra-layer similarity within the RGB decoder.
  • ...and 5 more figures