Table of Contents
Fetching ...

Self-Supervised Video Desmoking for Laparoscopic Surgery

Renlong Wu, Zhilu Zhang, Shuohao Zhang, Longfei Gou, Haobin Chen, Lei Zhang, Hao Chen, Wangmeng Zuo

TL;DR

This work tackles the challenge of removing surgical smoke from real laparoscopic videos without paired clean data by introducing SelfSVD, a self-supervised video desmoking framework that exploits pre-smoke frames $S_{ps}$ as supervision and as a reference input. A deformation-based loss with optical-flow alignment, a masking strategy, and a regularization term prevent trivial solutions, enabling stable learning on real-world smoky videos. The authors collect the LSVD dataset of real laparoscopic videos and demonstrate that SelfSVD and its lightweight variant outperform state-of-the-art methods in smoke removal and detail recovery, with practical real-time deployment potential. The approach advances practical desmoking by leveraging video structure and real pre-smoke frames, reducing domain gaps and enabling earlier, clearer visualization for surgeons.

Abstract

Due to the difficulty of collecting real paired data, most existing desmoking methods train the models by synthesizing smoke, generalizing poorly to real surgical scenarios. Although a few works have explored single-image real-world desmoking in unpaired learning manners, they still encounter challenges in handling dense smoke. In this work, we address these issues together by introducing the self-supervised surgery video desmoking (SelfSVD). On the one hand, we observe that the frame captured before the activation of high-energy devices is generally clear (named pre-smoke frame, PS frame), thus it can serve as supervision for other smoky frames, making real-world self-supervised video desmoking practically feasible. On the other hand, in order to enhance the desmoking performance, we further feed the valuable information from PS frame into models, where a masking strategy and a regularization term are presented to avoid trivial solutions. In addition, we construct a real surgery video dataset for desmoking, which covers a variety of smoky scenes. Extensive experiments on the dataset show that our SelfSVD can remove smoke more effectively and efficiently while recovering more photo-realistic details than the state-of-the-art methods. The dataset, codes, and pre-trained models are available at \url{https://github.com/ZcsrenlongZ/SelfSVD}.

Self-Supervised Video Desmoking for Laparoscopic Surgery

TL;DR

This work tackles the challenge of removing surgical smoke from real laparoscopic videos without paired clean data by introducing SelfSVD, a self-supervised video desmoking framework that exploits pre-smoke frames as supervision and as a reference input. A deformation-based loss with optical-flow alignment, a masking strategy, and a regularization term prevent trivial solutions, enabling stable learning on real-world smoky videos. The authors collect the LSVD dataset of real laparoscopic videos and demonstrate that SelfSVD and its lightweight variant outperform state-of-the-art methods in smoke removal and detail recovery, with practical real-time deployment potential. The approach advances practical desmoking by leveraging video structure and real pre-smoke frames, reducing domain gaps and enabling earlier, clearer visualization for surgeons.

Abstract

Due to the difficulty of collecting real paired data, most existing desmoking methods train the models by synthesizing smoke, generalizing poorly to real surgical scenarios. Although a few works have explored single-image real-world desmoking in unpaired learning manners, they still encounter challenges in handling dense smoke. In this work, we address these issues together by introducing the self-supervised surgery video desmoking (SelfSVD). On the one hand, we observe that the frame captured before the activation of high-energy devices is generally clear (named pre-smoke frame, PS frame), thus it can serve as supervision for other smoky frames, making real-world self-supervised video desmoking practically feasible. On the other hand, in order to enhance the desmoking performance, we further feed the valuable information from PS frame into models, where a masking strategy and a regularization term are presented to avoid trivial solutions. In addition, we construct a real surgery video dataset for desmoking, which covers a variety of smoky scenes. Extensive experiments on the dataset show that our SelfSVD can remove smoke more effectively and efficiently while recovering more photo-realistic details than the state-of-the-art methods. The dataset, codes, and pre-trained models are available at \url{https://github.com/ZcsrenlongZ/SelfSVD}.
Paper Structure (28 sections, 13 equations, 18 figures, 7 tables)

This paper contains 28 sections, 13 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: The comparison of surgical smoke in (a) and atmospheric haze in (b). Degraded images are in the top row and their clean reference images are in the bottom row.
  • Figure 2: The illustration of processing the $i$-th smoky frame $\mathbf{S}_{i}$. PS frame ($\mathbf{S}_{ps}$) is taken as both supervision and reference (Ref) input. A masking strategy with the masked-ref generator as shown in \ref{['fig:pipeline_2']} and a regularization term as \ref{['eq:RegLoss']} are introduced to prevent trivial solutions. $\mathbf{H}_{i-1}$ is the temporal features from previous frames and $\mathbf{H}_{i}$ is the temporal features for subsequent ones.
  • Figure 3: Examples of trivial solutions. When inputting PS frame as Ref naively, the imperfect optical flow between Ref and the smoky frame leads to trivial solutions, as indicated by yellow arrows. The same positions are marked with yellow lines.
  • Figure 4: The structure of masked-ref generator. The mask generator is used to generate a mask $\mathbf{M}_i$, which is employed to produce the masked reference features $\tilde{\mathbf{F}}_{ref \rightarrow i}$.
  • Figure 5: The illustration of enhancing PS frame ($\mathbf{S}_{ps}$) as supervision. We regard $\mathbf{S}_{ps}$ as a frame with less smoke and feed it into a pre-trained SelfSVD model, generating a cleaner result $\mathbf{S}_{ps}^{\ast}$. Then, $\mathbf{S}_{ps}^{\ast}$ is taken as improved supervision to fine-tune the SelfSVD model by $\mathcal{L}_{rec}$ and $\mathcal{L}_{GAN}$ (i.e., replacing $\mathbf{S}_{ps}$ with $\mathbf{S}_{ps}^{\ast}$ in \ref{['eq:warpL1']} and \ref{['eq:GANLoss']}), getting an improved model named SelfSVD$^{\ast}$.
  • ...and 13 more figures