A vision-language model and platform for temporally mapping surgery from video

Dani Kiyasseh

A vision-language model and platform for temporally mapping surgery from video

Dani Kiyasseh

Abstract

Mapping surgery is fundamental to developing operative guidelines and enabling autonomous robotic surgery. Recent advances in artificial intelligence (AI) have shown promise in mapping the behaviour of surgeons from videos, yet current models remain narrow in scope, capturing limited behavioural components within single procedures, and offer limited translational value, as they remain inaccessible to practising surgeons. Here we introduce Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA), one of the most comprehensive annotated video libraries grown through an iterative self-labelling framework and encompassing over 650,000 videos across eight surgical specialties. To facilitate benchmarking, we publicly release HSA-27k, a subset of the Halsted Surgical Atlas. Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. To bridge the longstanding translational gap of surgical AI, we develop the Halsted web platform (https://halstedhealth.ai/) to provide surgeons anywhere in the world with the previously-unavailable capability of automatically mapping their own procedures within minutes. By standardizing unstructured surgical video data and making these capabilities directly accessible to surgeons, our work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.

A vision-language model and platform for temporally mapping surgery from video

Abstract

Paper Structure (30 sections, 1 equation, 9 figures, 3 tables)

This paper contains 30 sections, 1 equation, 9 figures, 3 tables.

Results
Discussion
Methods
Author contributions
Competing Interests

Figures (9)

Figure 1: Halsted maps surgery from video. Halsted is trained on the Halsted Surgical Atlas, a library with 650K+ videos, to generate a comprehensive mapping of surgery with 104 categories of surgical components. We present an example of such a mapping for the components of procedure, steps, anatomy, arm, and instrument.
Figure 2: Halsted learns to comprehensively map surgery across specialties. Halsted reliably (a) maps surgical components at various levels of granularity, from anatomy to procedure type, (b) assesses surgical proficiency irrespective of specialty, and (c) recognizes granular surgical components such as arms used, actions performed, and instruments used. The shaded area and error bars reflect one standard error from the mean.
Figure 3: Halsted learns a nuanced relationship between surgical videos. We present the two-dimensional UMAP embeddings of representations of video clips extracted by VideoMAE (left) and of the features generated by Halsted (right) in the final layer of the transformer decoder when tasked with assessing suturing proficiency. Each colour reflects a distinct procedure and marker size indicates proficiency (large markers indicate low proficiency). Although VideoMAE can distinguish between procedures, Halsted takes this one step further and clusters procedures of the same specialty, as observed with cardiac and gynecology procedures.
Figure 4: Halsted's performance in mapping micro-activity as a function of decoder size. We train Halsted with a 2-layer decoder or a Llama-3.2 (1B) decoder to perform the micro-activity task, jointly mapping surgical actions and instruments. We report performance using 5-fold Monte-Carlo cross-validation and show that these models perform on par with one another, irrespective of the size of the decoder.
Figure 5: Halsted's performance improves with a self-learning strategy. We train Halsted on HSA v1, the first version of the Halsted Surgical Atlas without any AI-generated annotations, and HSA v2, the final version of the Halsted Surgical Atlas after incorporating AI-generated annotations. For details on the iterative labelling process, see Methods. Both models are evaluated on the same held-out test sets using 5-fold Monte-Carlo cross-validation.
...and 4 more figures

A vision-language model and platform for temporally mapping surgery from video

Abstract

A vision-language model and platform for temporally mapping surgery from video

Authors

Abstract

Table of Contents

Figures (9)