Table of Contents
Fetching ...

Depth-Aware Endoscopic Video Inpainting

Francis Xiatian Zhang, Shuang Chen, Xianghua Xie, Hubert P. H. Shum

TL;DR

This work addresses the challenge of endoscopic video inpainting by preserving 3D spatial details through depth information. It introduces DAEVI, a depth-aware framework featuring Spatial-Temporal Guided Depth Estimation (STGDE), Bi-Modal Paired Channel Fusion (BMPCF), and a Depth-Enhanced Discriminator (DED) to estimate depth from visual features, fuse depth with visuals via paired channel operations, and enforce RGB-D realism during training. On the HyperKvasir dataset, DAEVI achieves approximately a 2% PSNR gain and a 6% reduction in MSE over state-of-the-art methods and generalizes to SERV-CT, with qualitative results showing improved preservation of microvessels and instrument boundaries. These advances enhance the reliability of inpainted endoscopic content for clinical decision-making and have potential to improve diagnostic and surgical planning outcomes.

Abstract

Video inpainting fills in corrupted video content with plausible replacements. While recent advances in endoscopic video inpainting have shown potential for enhancing the quality of endoscopic videos, they mainly repair 2D visual information without effectively preserving crucial 3D spatial details for clinical reference. Depth-aware inpainting methods attempt to preserve these details by incorporating depth information. Still, in endoscopic contexts, they face challenges including reliance on pre-acquired depth maps, less effective fusion designs, and ignorance of the fidelity of 3D spatial details. To address them, we introduce a novel Depth-aware Endoscopic Video Inpainting (DAEVI) framework. It features a Spatial-Temporal Guided Depth Estimation module for direct depth estimation from visual features, a Bi-Modal Paired Channel Fusion module for effective channel-by-channel fusion of visual and depth information, and a Depth Enhanced Discriminator to assess the fidelity of the RGB-D sequence comprised of the inpainted frames and estimated depth images. Experimental evaluations on established benchmarks demonstrate our framework's superiority, achieving a 2% improvement in PSNR and a 6% reduction in MSE compared to state-of-the-art methods. Qualitative analyses further validate its enhanced ability to inpaint fine details, highlighting the benefits of integrating depth information into endoscopic inpainting.

Depth-Aware Endoscopic Video Inpainting

TL;DR

This work addresses the challenge of endoscopic video inpainting by preserving 3D spatial details through depth information. It introduces DAEVI, a depth-aware framework featuring Spatial-Temporal Guided Depth Estimation (STGDE), Bi-Modal Paired Channel Fusion (BMPCF), and a Depth-Enhanced Discriminator (DED) to estimate depth from visual features, fuse depth with visuals via paired channel operations, and enforce RGB-D realism during training. On the HyperKvasir dataset, DAEVI achieves approximately a 2% PSNR gain and a 6% reduction in MSE over state-of-the-art methods and generalizes to SERV-CT, with qualitative results showing improved preservation of microvessels and instrument boundaries. These advances enhance the reliability of inpainted endoscopic content for clinical decision-making and have potential to improve diagnostic and surgical planning outcomes.

Abstract

Video inpainting fills in corrupted video content with plausible replacements. While recent advances in endoscopic video inpainting have shown potential for enhancing the quality of endoscopic videos, they mainly repair 2D visual information without effectively preserving crucial 3D spatial details for clinical reference. Depth-aware inpainting methods attempt to preserve these details by incorporating depth information. Still, in endoscopic contexts, they face challenges including reliance on pre-acquired depth maps, less effective fusion designs, and ignorance of the fidelity of 3D spatial details. To address them, we introduce a novel Depth-aware Endoscopic Video Inpainting (DAEVI) framework. It features a Spatial-Temporal Guided Depth Estimation module for direct depth estimation from visual features, a Bi-Modal Paired Channel Fusion module for effective channel-by-channel fusion of visual and depth information, and a Depth Enhanced Discriminator to assess the fidelity of the RGB-D sequence comprised of the inpainted frames and estimated depth images. Experimental evaluations on established benchmarks demonstrate our framework's superiority, achieving a 2% improvement in PSNR and a 6% reduction in MSE compared to state-of-the-art methods. Qualitative analyses further validate its enhanced ability to inpaint fine details, highlighting the benefits of integrating depth information into endoscopic inpainting.
Paper Structure (11 sections, 10 equations, 3 figures, 3 tables)

This paper contains 11 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison with previous methods by Newson et al.newson2014video and Daher et al.daher2023temporal on corrupted frames from the HyperKvasir dataset borgli2020hyperkvasir. Red boxes highlight significant differences. Reference frames are near frames with less corruption. Our inpainted content is not only visually plausible but also contextually realistic.
  • Figure 2: The overview of our framework. First, our Spatial-Temporal Guided Depth Estimation module translates depth information from corrupted frames (See \ref{['sec:stgde']}). Second, our Bi-Modal Paired Channel Fusion module effectively fuses visual features with depth features (See \ref{['sec:BMPCF']}). Third, our Depth Enhanced Discriminator assesses the fidelity of the inpainted RGB-D sequence (See \ref{['sec:DED']}).
  • Figure 3: Comparison of deep learning-based inpainting performance on the SERV-CT dataset: a) Generalization Capability, and b) Depth Information Preservation.