Twins-PainViT: Towards a Modality-Agnostic Vision Transformer Framework for Multimodal Automatic Pain Assessment using Facial Videos and fNIRS
Stefanos Gkikas, Manolis Tsiknakis
TL;DR
The paper tackles multimodal automatic pain assessment by introducing Twins-PainViT, a modality-agnostic framework that treats facial videos and fNIRS as unified inputs through waveform representations and a dual Vision Transformer architecture. PainViT-1 extracts embeddings from each modality, which are visualized as 224×224 waveform diagrams and passed to PainViT-2 for final pain prediction. The authors pre-train the model in a multi-task setting across diverse emotion and biosignal datasets, and employ extensive augmentations and regularization to improve generalization. Empirical results on the AI4PAIN dataset show unimodal and multimodal benefits, with the Single Diagram fusion achieving the highest accuracy of 46.76%, outperforming the baseline by 6.56 percentage points. The work highlights interpretable attention patterns across modalities and underscores the potential of modality-agnostic transformers for real-world pain monitoring, while noting challenges for clinical deployment and the need for further validation.
Abstract
Automatic pain assessment plays a critical role for advancing healthcare and optimizing pain management strategies. This study has been submitted to the First Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed multimodal framework utilizes facial videos and fNIRS and presents a modality-agnostic approach, alleviating the need for domain-specific models. Employing a dual ViT configuration and adopting waveform representations for the fNIRS, as well as for the extracted embeddings from the two modalities, demonstrate the efficacy of the proposed method, achieving an accuracy of 46.76% in the multilevel pain assessment task.
