Table of Contents
Fetching ...

GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement

Hao Wang, Euijoon Ahn, Jinman Kim

TL;DR

This work investigates how to repurpose general video transformers for remote physiological measurement (RPM) tasks, notably rPPG-based heart-rate estimation, without relying on RPM-specific modules. By formulating practical guidelines for data pre-processing and network configuration, the authors adapt MViTv2 and demonstrate that DiffNorm, appropriate temporal hierarchies, and relative positional encoding enable robust performance across five public RPM datasets in intra- and cross-dataset settings. The results show that GVT2RPM variants can exceed state-of-the-art RPM methods, and that the approach generalizes across different GVT architectures (MViTv2, UniFormer, Video Swin). The work highlights a path to leverage advances in general video understanding for RPM, reducing task-specific customization while maintaining strong performance, though it acknowledges limitations related to skin tone effects, model scale, and manual configuration. Overall, GVT2RPM provides a flexible, architecture-agnostic framework for translating general video transformers to RPM with broad cross-dataset robustness and practical applicability in remote healthcare monitoring.

Abstract

Remote physiological measurement (RPM) is an essential tool for healthcare monitoring as it enables the measurement of physiological signs, e.g., heart rate, in a remote setting via physical wearables. Recently, with facial videos, we have seen rapid advancements in video-based RPMs. However, adopting facial videos for RPM in the clinical setting largely depends on the accuracy and robustness (work across patient populations). Fortunately, the capability of the state-of-the-art transformer architecture in general (natural) video understanding has resulted in marked improvements and has been translated to facial understanding, including RPM. However, existing RPM methods usually need RPM-specific modules, e.g., temporal difference convolution and handcrafted feature maps. Although these customized modules can increase accuracy, they are not demonstrated for their robustness across datasets. Further, due to their customization of the transformer architecture, they cannot use the advancements made in general video transformers (GVT). In this study, we interrogate the GVT architecture and empirically analyze how the training designs, i.e., data pre-processing and network configurations, affect the model performance applied to RPM. Based on the structure of video transformers, we propose to configure its spatiotemporal hierarchy to align with the dense temporal information needed in RPM for signal feature extraction. We define several practical guidelines and gradually adapt GVTs for RPM without introducing RPM-specific modules. Our experiments demonstrate favorable results to existing RPM-specific module counterparts. We conducted extensive experiments with five datasets using intra-dataset and cross-dataset settings. We highlight that the proposed guidelines GVT2RPM can be generalized to any video transformers and is robust to various datasets.

GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement

TL;DR

This work investigates how to repurpose general video transformers for remote physiological measurement (RPM) tasks, notably rPPG-based heart-rate estimation, without relying on RPM-specific modules. By formulating practical guidelines for data pre-processing and network configuration, the authors adapt MViTv2 and demonstrate that DiffNorm, appropriate temporal hierarchies, and relative positional encoding enable robust performance across five public RPM datasets in intra- and cross-dataset settings. The results show that GVT2RPM variants can exceed state-of-the-art RPM methods, and that the approach generalizes across different GVT architectures (MViTv2, UniFormer, Video Swin). The work highlights a path to leverage advances in general video understanding for RPM, reducing task-specific customization while maintaining strong performance, though it acknowledges limitations related to skin tone effects, model scale, and manual configuration. Overall, GVT2RPM provides a flexible, architecture-agnostic framework for translating general video transformers to RPM with broad cross-dataset robustness and practical applicability in remote healthcare monitoring.

Abstract

Remote physiological measurement (RPM) is an essential tool for healthcare monitoring as it enables the measurement of physiological signs, e.g., heart rate, in a remote setting via physical wearables. Recently, with facial videos, we have seen rapid advancements in video-based RPMs. However, adopting facial videos for RPM in the clinical setting largely depends on the accuracy and robustness (work across patient populations). Fortunately, the capability of the state-of-the-art transformer architecture in general (natural) video understanding has resulted in marked improvements and has been translated to facial understanding, including RPM. However, existing RPM methods usually need RPM-specific modules, e.g., temporal difference convolution and handcrafted feature maps. Although these customized modules can increase accuracy, they are not demonstrated for their robustness across datasets. Further, due to their customization of the transformer architecture, they cannot use the advancements made in general video transformers (GVT). In this study, we interrogate the GVT architecture and empirically analyze how the training designs, i.e., data pre-processing and network configurations, affect the model performance applied to RPM. Based on the structure of video transformers, we propose to configure its spatiotemporal hierarchy to align with the dense temporal information needed in RPM for signal feature extraction. We define several practical guidelines and gradually adapt GVTs for RPM without introducing RPM-specific modules. Our experiments demonstrate favorable results to existing RPM-specific module counterparts. We conducted extensive experiments with five datasets using intra-dataset and cross-dataset settings. We highlight that the proposed guidelines GVT2RPM can be generalized to any video transformers and is robust to various datasets.
Paper Structure (29 sections, 4 figures, 2 tables)

This paper contains 29 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our proposed guidelines for adapting GVTs to remote physiological measurement. We used blue color to highlight the designs that could affect final performance. The parameters of each guideline are listed within the bracket and are selected based on empirical results.
  • Figure 2: Experiment results under MMPD-simple intra-dataset setting by exploring the adaption from MViTv2 to GVT2RPM-MViT.
  • Figure 3: Design of upsampling module.
  • Figure 4: Intra-dataset experiment results on MMPD-simple, MMPD, RLAP, and UBFC-rPPG. We evaluated five RPM SOTA methods. Their averaged results are denoted by SOTA-avg with error bar. The best performing method is denoted by SOTA-best. Also, we tested three GVTs, including MViTv2, UniFormer, and Video Swin. Based on our empirical results in Section \ref{['sec:empirical']}, we constructed GVT2RPM-MViT-general, GVT2RPM-UniFormer-general, and GVT2RPM-Swin-general. Following Section \ref{['sec:guideline']}, we further optimized them into GVT2RPM-MViT-optimal, GVT2RPM-UniFormer-optimal, and GVT2RPM-Swin-optimal. We evaluated the performance by MAE.