PitVis-2023 Challenge: Workflow Recognition in videos of Endoscopic Pituitary Surgery
Adrito Das, Danyal Z. Khan, Dimitrios Psychogyios, Yitong Zhang, John G. Hanrahan, Francisco Vasconcelos, You Pang, Zhen Chen, Jinlin Wu, Xiaoyang Zou, Guoyan Zheng, Abdul Qayyum, Moona Mazher, Imran Razzak, Tianbin Li, Jin Ye, Junjun He, Szymon Płotka, Joanna Kaleta, Amine Yamlahi, Antoine Jund, Patrick Godau, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Dominik Rivoir, Alejandra Pérez, Santiago Rodriguez, Pablo Arbeláez, Danail Stoyanov, Hani J. Marcus, Sophia Bano
TL;DR
PitVis-2023 tackles automated workflow recognition in endoscopic pituitary surgery by defining three tasks—step recognition, instrument recognition, and their joint prediction—and providing a public dataset of 25 training and 8 testing videos with per-second annotations. The study surveys related work, outlines robust evaluation metrics that balance per-frame accuracy and temporal stability, and reports results from 9 teams employing diverse spatio-temporal architectures and multi-task strategies. Findings show that temporal modeling and multi-task approaches substantially improve performance over purely spatial baselines, with top results around 60% Macro-F1 on steps and 40%–45% on instruments, though generalization remains a challenge due to data scarcity and class imbalance. The work offers benchmarked baselines, dataset resources, and insights that push MIS CV toward clinically useful real-time assistance, while highlighting avenues for future improvements in explainability, data breadth, and transfer learning.
Abstract
The field of computer vision applied to videos of minimally invasive surgery is ever-growing. Workflow recognition pertains to the automated recognition of various aspects of a surgery: including which surgical steps are performed; and which surgical instruments are used. This information can later be used to assist clinicians when learning the surgery; during live surgery; and when writing operation notes. The Pituitary Vision (PitVis) 2023 Challenge tasks the community to step and instrument recognition in videos of endoscopic pituitary surgery. This is a unique task when compared to other minimally invasive surgeries due to the smaller working space, which limits and distorts vision; and higher frequency of instrument and step switching, which requires more precise model predictions. Participants were provided with 25-videos, with results presented at the MICCAI-2023 conference as part of the Endoscopic Vision 2023 Challenge in Vancouver, Canada, on 08-Oct-2023. There were 18-submissions from 9-teams across 6-countries, using a variety of deep learning models. A commonality between the top performing models was incorporating spatio-temporal and multi-task methods, with greater than 50% and 10% macro-F1-score improvement over purely spacial single-task models in step and instrument recognition respectively. The PitVis-2023 Challenge therefore demonstrates state-of-the-art computer vision models in minimally invasive surgery are transferable to a new dataset, with surgery specific techniques used to enhance performance, progressing the field further. Benchmark results are provided in the paper, and the dataset is publicly available at: https://doi.org/10.5522/04/26531686.
