Table of Contents
Fetching ...

Twins-PainViT: Towards a Modality-Agnostic Vision Transformer Framework for Multimodal Automatic Pain Assessment using Facial Videos and fNIRS

Stefanos Gkikas, Manolis Tsiknakis

TL;DR

The paper tackles multimodal automatic pain assessment by introducing Twins-PainViT, a modality-agnostic framework that treats facial videos and fNIRS as unified inputs through waveform representations and a dual Vision Transformer architecture. PainViT-1 extracts embeddings from each modality, which are visualized as 224×224 waveform diagrams and passed to PainViT-2 for final pain prediction. The authors pre-train the model in a multi-task setting across diverse emotion and biosignal datasets, and employ extensive augmentations and regularization to improve generalization. Empirical results on the AI4PAIN dataset show unimodal and multimodal benefits, with the Single Diagram fusion achieving the highest accuracy of 46.76%, outperforming the baseline by 6.56 percentage points. The work highlights interpretable attention patterns across modalities and underscores the potential of modality-agnostic transformers for real-world pain monitoring, while noting challenges for clinical deployment and the need for further validation.

Abstract

Automatic pain assessment plays a critical role for advancing healthcare and optimizing pain management strategies. This study has been submitted to the First Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed multimodal framework utilizes facial videos and fNIRS and presents a modality-agnostic approach, alleviating the need for domain-specific models. Employing a dual ViT configuration and adopting waveform representations for the fNIRS, as well as for the extracted embeddings from the two modalities, demonstrate the efficacy of the proposed method, achieving an accuracy of 46.76% in the multilevel pain assessment task.

Twins-PainViT: Towards a Modality-Agnostic Vision Transformer Framework for Multimodal Automatic Pain Assessment using Facial Videos and fNIRS

TL;DR

The paper tackles multimodal automatic pain assessment by introducing Twins-PainViT, a modality-agnostic framework that treats facial videos and fNIRS as unified inputs through waveform representations and a dual Vision Transformer architecture. PainViT-1 extracts embeddings from each modality, which are visualized as 224×224 waveform diagrams and passed to PainViT-2 for final pain prediction. The authors pre-train the model in a multi-task setting across diverse emotion and biosignal datasets, and employ extensive augmentations and regularization to improve generalization. Empirical results on the AI4PAIN dataset show unimodal and multimodal benefits, with the Single Diagram fusion achieving the highest accuracy of 46.76%, outperforming the baseline by 6.56 percentage points. The work highlights interpretable attention patterns across modalities and underscores the potential of modality-agnostic transformers for real-world pain monitoring, while noting challenges for clinical deployment and the need for further validation.

Abstract

Automatic pain assessment plays a critical role for advancing healthcare and optimizing pain management strategies. This study has been submitted to the First Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed multimodal framework utilizes facial videos and fNIRS and presents a modality-agnostic approach, alleviating the need for domain-specific models. Employing a dual ViT configuration and adopting waveform representations for the fNIRS, as well as for the extracted embeddings from the two modalities, demonstrate the efficacy of the proposed method, achieving an accuracy of 46.76% in the multilevel pain assessment task.
Paper Structure (18 sections, 10 equations, 4 figures, 14 tables)

This paper contains 18 sections, 10 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: PainViT: (a) Hierarchical organization of the PainViT blocks, each with different depths, illustrating the reduction in token resolution at each stage; (b) Detail of the Token-Mixer module, showcasing its components including a depthwise convolution (DWConv) and batch normalization; (c) The Feed-Forward Network (FFN) structure within the Token-Mixer; (d) The Cascaded Attention mechanism across multiple heads, depicting the process of adding outputs from previous heads to enhance the self-attention computation, and the final output projection; (e) Overview of the proposed multimodal pipeline, utilizing videos and fNIRS. The extracted embeddings from PainViT--1 are visualized as waveform diagrams, which are then combined into a single diagram depicting both modalities before being entered into PainViT--2 for the final pain assessment.
  • Figure 2: Waveform diagrams representing different data modalities: (a) original fNIRS signal waveform, (b) video embedding extracted from PainViT--1, and (c) fNIRS embedding extracted from PainViT--1.
  • Figure 3: Attention maps from the PainViT--2.
  • Figure 4: Additional attention maps from the PainViT--2.