Table of Contents
Fetching ...

Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation

Xiaoxian Zhang, Minghai Shi, Lei Li

Abstract

Accurate monocular depth estimation is critical in colonoscopy for lesion localization and navigation. Foundation models trained on natural images fail to generalize directly to colonoscopy. We identify the core issue not as a semantic gap, but as a statistical shift in the frequency domain: colonoscopy images lack the strong high-frequency edge and texture gradients that these models rely on for geometric reasoning. To address this, we propose SpecDepth, a parameter-efficient adaptation framework that preserves the robust geometric representations of the pre-trained models while adapting to the colonoscopy domain. Its key innovation is an adaptive spectral rectification module, which uses a learnable wavelet decomposition to explicitly model and amplify the attenuated high-frequency components in feature maps. Different from conventional fine-tuning that risks distorting high-level semantic features, this targeted, low-level adjustment realigns the input signal with the original inductive bias of the foundational model. On the public C3VD and SimCol3D datasets, SpecDepth achieved state-of-the-art performance with an absolute relative error of 0.022 and 0.027, respectively. Our work demonstrates that directly addressing spectral mismatches is a highly effective strategy for adapting vision foundation models to specialized medical imaging tasks. The code will be released publicly after the manuscript is accepted for publication.

Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation

Abstract

Accurate monocular depth estimation is critical in colonoscopy for lesion localization and navigation. Foundation models trained on natural images fail to generalize directly to colonoscopy. We identify the core issue not as a semantic gap, but as a statistical shift in the frequency domain: colonoscopy images lack the strong high-frequency edge and texture gradients that these models rely on for geometric reasoning. To address this, we propose SpecDepth, a parameter-efficient adaptation framework that preserves the robust geometric representations of the pre-trained models while adapting to the colonoscopy domain. Its key innovation is an adaptive spectral rectification module, which uses a learnable wavelet decomposition to explicitly model and amplify the attenuated high-frequency components in feature maps. Different from conventional fine-tuning that risks distorting high-level semantic features, this targeted, low-level adjustment realigns the input signal with the original inductive bias of the foundational model. On the public C3VD and SimCol3D datasets, SpecDepth achieved state-of-the-art performance with an absolute relative error of 0.022 and 0.027, respectively. Our work demonstrates that directly addressing spectral mismatches is a highly effective strategy for adapting vision foundation models to specialized medical imaging tasks. The code will be released publicly after the manuscript is accepted for publication.
Paper Structure (22 sections, 7 equations, 8 figures, 5 tables)

This paper contains 22 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Statistical shift in the frequency domain between natural and colonoscopy images. (a) Visual comparison of Fourier magnitude spectra. Natural images (left) display prominent star-like anisotropic radiation patterns, reflecting the abundance of straight lines and sharp edges in real-world scenes. Colonoscopy images (right) show centrally concentrated, more isotropic diffusion patterns, indicating a lack of directional geometric structures, with high-frequency energy decaying rapidly but influenced by specular highlights and unstructured noise. (b) Scatter plot of power-law spectral slope ($\alpha$) versus linearity of the fit ($R^2$). Natural street scenes (orange clusters) exhibit spectral slopes and linearity that strictly adhere to the characteristic power-law distribution of natural scenes. Conversely, colonoscopy images (blue clusters) significantly deviate from this norm, demonstrating a distinct statistical shift in the frequency domain. This shift is characterized by impulsive high-frequency components derived from mucosal artifacts, which elevate the energy in the spectral tail and result in a slower decay rate. The consistently lower $R^2$ values further indicate that biological tissue lacks the strict scale invariance typical of macroscopic natural scenes.
  • Figure 2: Overview of the proposed SpecDepth framework for the depth estimation of colonoscopy images. It employs a partially frozen DINOv2 encoder for stable feature extraction, followed by the gated wavelet transform to decouple feature representations in the frequency domain and selectively amplify critical edge signals. LN: layer normalization. MHSA: multi-head self-attention. MLP: multi-layer perceptron. BN: batch normalization.
  • Figure 3: Illustration of wavelet decomposition process used in wavelet transform convolution. Note that here 2× upsampling applied to frequency-domain components for visual alignment with spatial domain.
  • Figure 4: Visualization of monocular depth estimation on the C3VD dataset of different methods. C3VD is a real clinical colonoscopy dataset providing high-resolution video sequences with pixel-level ground-truth depth annotations. Compared to the baseline methods, SpecDepth demonstrated superior preservation of structural boundaries and produces smoother, more geometrically coherent gradients in texture-sparse mucosal regions.
  • Figure 5: Visualization of monocular depth estimation on the SimCol3D dataset of different methods. EcoDepth was heavily corrupted by structured noise, while Marigold, though smoother, tended to produce blob-shaped depth maps that failed to follow the elongated recession of the lumen. Depth Anything V2 captured the overall scene geometry more faithfully but often folded boundaries and transitional regions. SpecDepth produced the most coherent results, accurately rendering the tubular depth gradient along the lumen axis and maintaining crisp fold structures throughout.
  • ...and 3 more figures