Table of Contents
Fetching ...

TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning

Rui Wang, Ying Zhou, Hao Wang, Wenwei Zhang, Qiang Li, Zhiwei Wang

TL;DR

This work tackles the challenge of dense disparity supervision for stereo video in MIS by introducing TiS-TSL, a time-switchable model that unifies image and video inference through three GRU-based modes: IP, FVP, and BVP. It trains via a two-stage pipeline, Image-to-Video (I2V) to initialize temporal modeling with sparse image labels and Video-to-Video (V2V) to enforce bidirectional temporal consistency using Spatio-Temporal Confidence Filtering Mechanism (ST-CFM). The approach yields temporally coherent disparity maps with strong improvements in TEPE and EPE on SCARED and Hamlyn, requiring only a single labeled frame per video. Its practical impact lies in enabling robust 3D surgical navigation and AR guidance with minimal annotation burden, while maintaining real-time potential thanks to efficient design and run-time. The core mathematical construct, the ST-CFM weight $W_t = \frac{1}{1 + e^{\epsilon (|\hat{\mathcal{D}}_t^f - \hat{\mathcal{D}}_t^b| - \tau)}}$, underpins reliable pseudo-label filtering across time.$

Abstract

Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively.

TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning

TL;DR

This work tackles the challenge of dense disparity supervision for stereo video in MIS by introducing TiS-TSL, a time-switchable model that unifies image and video inference through three GRU-based modes: IP, FVP, and BVP. It trains via a two-stage pipeline, Image-to-Video (I2V) to initialize temporal modeling with sparse image labels and Video-to-Video (V2V) to enforce bidirectional temporal consistency using Spatio-Temporal Confidence Filtering Mechanism (ST-CFM). The approach yields temporally coherent disparity maps with strong improvements in TEPE and EPE on SCARED and Hamlyn, requiring only a single labeled frame per video. Its practical impact lies in enabling robust 3D surgical navigation and AR guidance with minimal annotation burden, while maintaining real-time potential thanks to efficient design and run-time. The core mathematical construct, the ST-CFM weight , underpins reliable pseudo-label filtering across time.$

Abstract

Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively.

Paper Structure

This paper contains 26 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The overview of our proposed TiS-TSL. At its core is a time-switchable model, which operates in three distinct modes based on the flow patterns of the hidden states within the GRUs, i.e., Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP). Enabled by this model, TiS-TSL comprises two stages, i.e., Image-to-Video (I2V) stage and Video-to-Video (V2V) stage. The I2V stage aims to pretrain spatial components using labeled images while initializing temporal modeling on unlabeled video via pseudo-supervision. The subsequent V2V stage applies bidirectional prediction consistency to filter out unreliable regions in the pseudo labels, thereby explicitly reinforcing temporal consistency.
  • Figure 2: Qualitative comparisons with image-based methods on SCARED and Hamlyn videos. The second column represents the temporal image profiles obtained by slicing the images along the timeline at green-line positions. The subsequent columns represent the corresponding disparity profiles. The red arrows indicate highlight regions. OpenCV Jet Colormap is used for visualization.
  • Figure 3: Error maps of predicted disparities on SCARED and Hamlyn. Regions indicated by red arrows in the figure represent flat areas, while those indicated by yellow arrows represent boundary areas. The black regions represent invalid pixels in the GT disparity.
  • Figure 4: Qualitative comparisons with two video-based methods on SCARED videos. The second column represents the temporal image profiles obtained by slicing the images along the timeline at green-line positions. The subsequent columns represent the corresponding disparity profiles. The red arrows indicate highlight regions. OpenCV Jet Colormap is used for visualization.