Table of Contents
Fetching ...

QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Trinh T. L. Vuong, Doanh C. Bui, Jin Tae Kwak

TL;DR

This work tackles automation in life-saving interventions from a first-person perspective by addressing action recognition, action anticipation, and visual question answering (VQA) in the Trauma THOMPSON (T3) challenge. It combines an Action Dictionary-guided (ADG) learning scheme with momentum-contrast distillation (MoMA) and image-based preprocessing to transfer knowledge from large-scale sources to medical procedure tasks, achieving strong results in action-related tasks. For VQA, the pipeline leverages VinVL object features and deep modular co-attention networks (MCAN) augmented with a frame-question cross-attention (FQCA) mechanism, resulting in state-of-the-art performance among its experiments. The work demonstrates that first-person, modality-focused representations and cross-modal attention can substantially improve automated understanding and guidance in life-saving scenarios, with practical implications for remote instruction and support in austere environments.

Abstract

In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the $2^{nd}$ rank in action recognition and anticipation tasks and $1^{st}$ rank in the VQA task.

QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

TL;DR

This work tackles automation in life-saving interventions from a first-person perspective by addressing action recognition, action anticipation, and visual question answering (VQA) in the Trauma THOMPSON (T3) challenge. It combines an Action Dictionary-guided (ADG) learning scheme with momentum-contrast distillation (MoMA) and image-based preprocessing to transfer knowledge from large-scale sources to medical procedure tasks, achieving strong results in action-related tasks. For VQA, the pipeline leverages VinVL object features and deep modular co-attention networks (MCAN) augmented with a frame-question cross-attention (FQCA) mechanism, resulting in state-of-the-art performance among its experiments. The work demonstrates that first-person, modality-focused representations and cross-modal attention can substantially improve automated understanding and guidance in life-saving scenarios, with practical implications for remote instruction and support in austere environments.

Abstract

In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the rank in action recognition and anticipation tasks and rank in the VQA task.
Paper Structure (22 sections, 5 equations, 2 figures, 4 tables)

This paper contains 22 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of ADG: Action Dictionary-guided learning model for action recognition and action anticipation task.
  • Figure 2: Illustration of our model for VQA task.