Table of Contents
Fetching ...

YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

Kedi Sun, Le Zhang

TL;DR

This work introduces a YOLOv10-based multi-task framework for real-time hand localization and hand laterality classification in first-person trauma videos, evaluated on the Trauma THOMPSON Challenge Task 2 dataset. By extending YOLOv10 to predict left and right hand as distinct classes and applying robust data augmentation, the model achieves a practical balance between speed and accuracy, with $mAP_{[0.5:0.95]} \approx 0.33$ and left/right accuracies of $67\%$ and $71\%$, while operating near $38$ FPS for real-time use. Ablation studies highlight the importance of augmentation and regularization and show that explicit laterality labeling adds a modest detection cost but preserves essential clinical information. The results underscore the feasibility of real-time, multi-task hand tracking and laterality assessment in emergency surgical settings, while pointing to future enhancements via temporal modeling and improved background discrimination to further boost robustness.

Abstract

Real-time hand tracking in trauma surgery is essential for supporting rapid and precise intraoperative decisions. We propose a YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes. The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset, consisting of first-person surgical videos with annotated hand bounding boxes. Extensive data augmentation and a multi-task detection design improve robustness against motion blur, lighting variations, and diverse hand appearances. Evaluation demonstrates accurate left-hand (67\%) and right-hand (71\%) classification, while distinguishing hands from the background remains challenging. The model achieves an $mAP_{[0.5:0.95]}$ of 0.33 and maintains real-time inference, highlighting its potential for intraoperative deployment. This work establishes a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures.

YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

TL;DR

This work introduces a YOLOv10-based multi-task framework for real-time hand localization and hand laterality classification in first-person trauma videos, evaluated on the Trauma THOMPSON Challenge Task 2 dataset. By extending YOLOv10 to predict left and right hand as distinct classes and applying robust data augmentation, the model achieves a practical balance between speed and accuracy, with and left/right accuracies of and , while operating near FPS for real-time use. Ablation studies highlight the importance of augmentation and regularization and show that explicit laterality labeling adds a modest detection cost but preserves essential clinical information. The results underscore the feasibility of real-time, multi-task hand tracking and laterality assessment in emergency surgical settings, while pointing to future enhancements via temporal modeling and improved background discrimination to further boost robustness.

Abstract

Real-time hand tracking in trauma surgery is essential for supporting rapid and precise intraoperative decisions. We propose a YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes. The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset, consisting of first-person surgical videos with annotated hand bounding boxes. Extensive data augmentation and a multi-task detection design improve robustness against motion blur, lighting variations, and diverse hand appearances. Evaluation demonstrates accurate left-hand (67\%) and right-hand (71\%) classification, while distinguishing hands from the background remains challenging. The model achieves an of 0.33 and maintains real-time inference, highlighting its potential for intraoperative deployment. This work establishes a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures.
Paper Structure (8 sections, 4 figures)

This paper contains 8 sections, 4 figures.

Figures (4)

  • Figure 1: Curves of various assessment indicators in the final round of training.
  • Figure 2: The normalized confusion matrix of the final training round.
  • Figure 3: Model performance on the test set.
  • Figure 4: Comparison of actual model performance (left) and Ground Truth (right).