MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction

Mithun Parab; Pranay Lendave; Jiyoung Kim; Thi Quynh Dan Nguyen; Palash Ingle

MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction

Mithun Parab, Pranay Lendave, Jiyoung Kim, Thi Quynh Dan Nguyen, Palash Ingle

TL;DR

MT3DNet addresses surgical scene understanding by jointly solving segmentation, instrument detection, and monocular depth estimation to enable 3D reconstruction from a single camera. It introduces an Adversarial Weight Update to balance multitask optimization within a Transformer-based MT3DNet framework consisting of an Encoder, Decoder, and task-specific heads. On EndoVis2018, MT3DNet achieves competitive segmentation and detection performance and depth accuracy (MAE around 2.2 mm), with ablations showing gains from adversarial weighting over single-task baselines. This approach advances real-time, depth-informed MIS scene understanding and opens avenues for broader clinical deployment and cross-modality extensions.

Abstract

In image-assisted minimally invasive surgeries (MIS), understanding surgical scenes is vital for real-time feedback to surgeons, skill evaluation, and improving outcomes through collaborative human-robot procedures. Within this context, the challenge lies in accurately detecting, segmenting, and estimating the depth of surgical scenes depicted in high-resolution images, while simultaneously reconstructing the scene in 3D and providing segmentation of surgical instruments along with detection labels for each instrument. To address this challenge, a novel Multi-Task Learning (MTL) network is proposed for performing these tasks concurrently. A key aspect of this approach involves overcoming the optimization hurdles associated with handling multiple tasks concurrently by integrating a Adversarial Weight Update into the MTL framework, the proposed MTL model achieves 3D reconstruction through the integration of segmentation, depth estimation, and object detection, thereby enhancing the understanding of surgical scenes, which marks a significant advancement compared to existing studies that lack 3D capabilities. Comprehensive experiments on the EndoVis2018 benchmark dataset underscore the adeptness of the model in efficiently addressing all three tasks, demonstrating the efficacy of the proposed techniques.

MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction

TL;DR

Abstract

MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)