QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View
Trinh T. L. Vuong, Doanh C. Bui, Jin Tae Kwak
TL;DR
This work tackles automation in life-saving interventions from a first-person perspective by addressing action recognition, action anticipation, and visual question answering (VQA) in the Trauma THOMPSON (T3) challenge. It combines an Action Dictionary-guided (ADG) learning scheme with momentum-contrast distillation (MoMA) and image-based preprocessing to transfer knowledge from large-scale sources to medical procedure tasks, achieving strong results in action-related tasks. For VQA, the pipeline leverages VinVL object features and deep modular co-attention networks (MCAN) augmented with a frame-question cross-attention (FQCA) mechanism, resulting in state-of-the-art performance among its experiments. The work demonstrates that first-person, modality-focused representations and cross-modal attention can substantially improve automated understanding and guidance in life-saving scenarios, with practical implications for remote instruction and support in austere environments.
Abstract
In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the $2^{nd}$ rank in action recognition and anticipation tasks and $1^{st}$ rank in the VQA task.
