Multitask Learning in Minimally Invasive Surgical Vision: A Review

Oluwatosin Alabi; Tom Vercauteren; Miaojing Shi

Multitask Learning in Minimally Invasive Surgical Vision: A Review

Oluwatosin Alabi, Tom Vercauteren, Miaojing Shi

TL;DR

This review analyzes how multitask learning (MTL) has been applied to minimally invasive surgical (MIS) vision, focusing on videos and images from MIS to jointly solve perceptual, tracking, workflow, anticipation, skill assessment, and report-generation tasks. It surveys common deep MTL methodologies (parameter sharing, optimization and task balancing, auxiliary objectives, and data-efficient strategies) and maps them to MIS applications, highlighting dominant use of hard parameter sharing and linear loss scalarization while noting opportunities to adopt advanced CV MTL techniques. The paper also catalogs public MIS datasets supporting multi-task learning, reviews large-model approaches (VQA and promptable segmentation), and discusses challenges around real-time deployment, data unification, and ethics. Key findings include widespread success of MTL for perceptual tasks and workflow analysis, the emergence of action-triplet and multi-granularity recognition, and the potential of large models to tackle multiple MIS tasks, tempered by data and deployment constraints. Overall, the authors provide a foundational reference that identifies current trends, gaps, and directions for future MIS MT L research, including standardized benchmarks and ethical considerations to enable robust real-time clinical adoption.

Abstract

Minimally invasive surgery (MIS) has revolutionized many procedures and led to reduced recovery time and risk of patient injury. However, MIS poses additional complexity and burden on surgical teams. Data-driven surgical vision algorithms are thought to be key building blocks in the development of future MIS systems with improved autonomy. Recent advancements in machine learning and computer vision have led to successful applications in analyzing videos obtained from MIS with the promise of alleviating challenges in MIS videos. Surgical scene and action understanding encompasses multiple related tasks that, when solved individually, can be memory-intensive, inefficient, and fail to capture task relationships. Multitask learning (MTL), a learning paradigm that leverages information from multiple related tasks to improve performance and aid generalization, is well suited for fine-grained and high-level understanding of MIS data. This review provides a narrative overview of the current state-of-the-art MTL systems that leverage videos obtained from MIS. Beyond listing published approaches, we discuss the benefits and limitations of these MTL systems. Moreover, this manuscript presents an analysis of the literature for various application fields of MTL in MIS, including those with large models, highlighting notable trends, new directions of research, and developments.

Multitask Learning in Minimally Invasive Surgical Vision: A Review

TL;DR

Abstract

Paper Structure (63 sections, 9 equations, 16 figures, 11 tables)

This paper contains 63 sections, 9 equations, 16 figures, 11 tables.

Introduction
Review methodology and related work
Scope of the review
Search criteria
Selection criteria
Analysis of selected papers
Related reviews
Common deep MTL methodologies in computer vision
MTL concepts
Parameter sharing and feature representation
Optimization and task balancing
Auxiliary objectives
Data efficient approaches
MTL and other learning paradigms
Applications of MTL in surgical scene understanding
...and 48 more sections

Figures (16)

Figure 1: Overview of the application areas where multitask learning has been applied in surgical scene understanding.
Figure 2: Top: hard parameter sharing in deep neural networks for multitask learning, featuring a shared encoder/backbone with a common representation and separate decoders or heads. Bottom: soft parameter sharing with separate models per task and specialized feature-sharing mechanism.
Figure 3: An overview of optimization techniques for multitask learning as discussed in Section \ref{['subsec::optimization_and_task_balancing']}. The classical method, linear scalarization, involves manually weighting the loss functions of all tasks. The first row illustrates linear scalarization against automatic loss weighting methods that dynamically adjust weights during training, such as updating weights to ensure gradient magnitude consistency gradnorm and defining a loss weighting equation based on predicted uncertainty of each task uncertainty_weighting. The second row illustrates gradient-based approaches in comparison to linear scalarization. Gradient-based methods directly modify gradients to mitigate negative transfer, achieved by projecting conflicting gradients to the normal plane of the gradient of another task pcgrad_paper or ensuring gradients are at a target angle to each other grad_vac. The third row illustrates linear scalarization and multi-objective optimization techniques. MTMO MTMO_koltun ensures that solutions are on the Pareto front, while PTML lin2019pareto enables the selection of Pareto front solutions, favouring specific tasks.
Figure 4: Diagram illustrating the key differences between multitask, transfer, multiclass, multiloss, multilabel, multimodal, multistage, and multiview learning.
Figure 5: Illustration of the Attention Pruned Multitask Learning (AP-MTL) Network and the optimization method used for training this network islam2020apmtl. The top image shows an encoder-decoder network with skip connections for its segmentation and detection decoders. A summary of the Asynchronous Task Optimization (ATO) for obtaining convergence for both tasks in the AP-MTL network is provided at the bottom.
...and 11 more figures

Multitask Learning in Minimally Invasive Surgical Vision: A Review

TL;DR

Abstract

Multitask Learning in Minimally Invasive Surgical Vision: A Review

Authors

TL;DR

Abstract

Table of Contents

Figures (16)