YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

Kedi Sun; Le Zhang

YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

Kedi Sun, Le Zhang

TL;DR

This work introduces a YOLOv10-based multi-task framework for real-time hand localization and hand laterality classification in first-person trauma videos, evaluated on the Trauma THOMPSON Challenge Task 2 dataset. By extending YOLOv10 to predict left and right hand as distinct classes and applying robust data augmentation, the model achieves a practical balance between speed and accuracy, with $mAP_{[0.5:0.95]} \approx 0.33$ and left/right accuracies of $67\%$ and $71\%$, while operating near $38$ FPS for real-time use. Ablation studies highlight the importance of augmentation and regularization and show that explicit laterality labeling adds a modest detection cost but preserves essential clinical information. The results underscore the feasibility of real-time, multi-task hand tracking and laterality assessment in emergency surgical settings, while pointing to future enhancements via temporal modeling and improved background discrimination to further boost robustness.

Abstract

Real-time hand tracking in trauma surgery is essential for supporting rapid and precise intraoperative decisions. We propose a YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes. The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset, consisting of first-person surgical videos with annotated hand bounding boxes. Extensive data augmentation and a multi-task detection design improve robustness against motion blur, lighting variations, and diverse hand appearances. Evaluation demonstrates accurate left-hand (67\%) and right-hand (71\%) classification, while distinguishing hands from the background remains challenging. The model achieves an $mAP_{[0.5:0.95]}$ of 0.33 and maintains real-time inference, highlighting its potential for intraoperative deployment. This work establishes a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures.

YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

TL;DR

and left/right accuracies of

and

, while operating near

FPS for real-time use. Ablation studies highlight the importance of augmentation and regularization and show that explicit laterality labeling adds a modest detection cost but preserves essential clinical information. The results underscore the feasibility of real-time, multi-task hand tracking and laterality assessment in emergency surgical settings, while pointing to future enhancements via temporal modeling and improved background discrimination to further boost robustness.

Abstract

of 0.33 and maintains real-time inference, highlighting its potential for intraoperative deployment. This work establishes a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures.

Paper Structure (8 sections, 4 figures)

This paper contains 8 sections, 4 figures.

Introduction
Methods
Dataset
Model Architectures
Training and Testing
Ablation Studies and Comparative Analysis
Results
Discussion and Conclusion

Figures (4)

Figure 1: Curves of various assessment indicators in the final round of training.
Figure 2: The normalized confusion matrix of the final training round.
Figure 3: Model performance on the test set.
Figure 4: Comparison of actual model performance (left) and Ground Truth (right).

YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

TL;DR

Abstract

YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (4)