Table of Contents
Fetching ...

Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge

Adam Schmidt, Mert Asim Karaoglu, Soham Sinha, Mingang Jang, Ho-Gun Ha, Kyungmin Jung, Kyeongmo Gu, Ihsan Ullah, Hyunki Lee, Jonáš Šerých, Michal Neoral, Jiří Matas, Rulin Zhou, Wenlong He, An Wang, Hongliang Ren, Bruno Silva, Sandro Queirós, Estêvão Lima, João L. Vilaça, Shunsuke Kikuchi, Atsushi Kouno, Hiroki Matsuzaki, Tongtong Li, Yulu Chen, Ling Li, Xiang Ma, Xiaojian Li, Mona Sheikh Zeinoddin, Xu Wang, Zafer Tandogdu, Greg Shaw, Evangelos Mazomenos, Danail Stoyanov, Yuxin Chen, Zijian Wu, Alexander Ladikos, Simon DiMaio, Septimiu E. Salcudean, Omid Mohareri

TL;DR

The STIR Challenge 2024 addresses the problem of accurate and efficient point tracking in surgical scenes by evaluating algorithms on the STIRC2024 dataset of infrared tattoo-ground-truth sequences. It introduces a dual-metric evaluation—2D/3D tracking accuracy via the $oldsymbol{ig(delta^{avg}})$ metric and real-time inference latency up to the 99th percentile—to encourage robust, deployable solutions. Baselines (MFT, CSRT, RAFT, RAFT Stereo, and a Control) are compared against challenge-day teams and post-challenge entrants, revealing that long-term, occlusion-aware strategies (e.g., MedTrack, TAP-Endo) can outperform purely frame-based approaches in certain scenarios, while 3D tracking benefits from stereo depth integration. The dataset, evaluation protocol, and open-source baselines/code provide a critical resource for advancing surgical image guidance, with implications for segmentation, reconstruction, landmarking, and autonomous assistance in the operating room.

Abstract

Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification. The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024. The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests the latency of algorithm inference. The challenge was conducted as a part of MICCAI EndoVis 2024. In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day. This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery. In this paper we summarize the design, submissions, and results from the challenge. The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics

Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge

TL;DR

The STIR Challenge 2024 addresses the problem of accurate and efficient point tracking in surgical scenes by evaluating algorithms on the STIRC2024 dataset of infrared tattoo-ground-truth sequences. It introduces a dual-metric evaluation—2D/3D tracking accuracy via the metric and real-time inference latency up to the 99th percentile—to encourage robust, deployable solutions. Baselines (MFT, CSRT, RAFT, RAFT Stereo, and a Control) are compared against challenge-day teams and post-challenge entrants, revealing that long-term, occlusion-aware strategies (e.g., MedTrack, TAP-Endo) can outperform purely frame-based approaches in certain scenarios, while 3D tracking benefits from stereo depth integration. The dataset, evaluation protocol, and open-source baselines/code provide a critical resource for advancing surgical image guidance, with implications for segmentation, reconstruction, landmarking, and autonomous assistance in the operating room.

Abstract

Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification. The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024. The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests the latency of algorithm inference. The challenge was conducted as a part of MICCAI EndoVis 2024. In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day. This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery. In this paper we summarize the design, submissions, and results from the challenge. The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics

Paper Structure

This paper contains 33 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: This figure describes the STIR Challenge 2024. Participants submit their algorithms in a docker container. The algorithm receives a video and a list of start points from each sequence in the dataset. Participants use their tracker to estimate the motion of a set of points for every frame in a video. Videos are provided in a streaming manner. The final estimates are then compared to the ground truth labels (Section \ref{['sec:dataandannotation']}) at the end of the video. The errors (Section \ref{['sec:metrics']}) are then averaged across all points to obtain the final metrics in 2D or 3D. Latency is also calculated alongside the inference for those who participated in the efficiency component of the challenge.
  • Figure 2: Temporal distribution of videos. Most clips lie between 0 and 10 seconds, with a few longer clips $>20$ seconds. Average clip length is 8.9 seconds.
  • Figure 3: Number of labelled points per video. Labels can be seen in Fig. \ref{['fig:startsegs']}.
  • Figure 4: Start point labels for all 60 sequences in in the STIR 2024 test dataset. For each sequence, center points are extracted from each segmentation, and passed to each participant's tracker.
  • Figure 5: Dataset labels and label creation process. The ground truth is collected by using a tattoo needle to label points at the start and end of video frames. After tattooing is completed, multiple sequences can be collected. For each sequence, the camera captures an image in infrared (ground truth start frame), then switches to white light. Actions are performed under white light, and this video is recorded and saved. Then the camera switches back to IR and captures the end frame which is used as the ground truth for each point's motion. Segments are the binary-thresholded IR images; tattooed regions are shown in white. On the right is a figure showing a set of random triplets with the triplet: (IR image, visible light image, segment/GT image) for each point shown.
  • ...and 6 more figures