Table of Contents
Fetching ...

CodePhys: Robust Video-based Remote Physiological Measurement through Latent Codebook Querying

Shuyang Chu, Menghan Xia, Mengyao Yuan, Xin Liu, Tapio Seppanen, Guoying Zhao, Jingang Shi

TL;DR

CodePhys tackles the challenge of noise-robust remote PPG extraction from facial videos by learning a noise-free latent PPG codebook and transforming rPPG measurement into a code-query task. Stage I builds a discrete codebook of GT-PPG features via a signal autoencoder, while Stage II uses a spatial-aware encoder with APB and soft feature distillation to query the codebook and reconstruct high-fidelity rPPG signals. The approach yields state-of-the-art intra- and cross-dataset performance on VIPL-HR, UBFC-rPPG, PURE, and COHFACE, and demonstrates strong robustness to common real-world video degradations. CodePhys also functions as a plug-and-play framework, offering robustness enhancements for existing end-to-end rPPG models with efficient inference and clear ablation-backed design choices. Overall, the method advances robust, non-contact heart-rate estimation by integrating discrete priors and query-based correction in a two-stage, spatially informed architecture, with potential for online adaptation and edge deployment.

Abstract

Remote photoplethysmography (rPPG) aims to measure non-contact physiological signals from facial videos, which has shown great potential in many applications. Most existing methods directly extract video-based rPPG features by designing neural networks for heart rate estimation. Although they can achieve acceptable results, the recovery of rPPG signal faces intractable challenges when interference from real-world scenarios takes place on facial video. Specifically, facial videos are inevitably affected by non-physiological factors (e.g., camera device noise, defocus, and motion blur), leading to the distortion of extracted rPPG signals. Recent rPPG extraction methods are easily affected by interference and degradation, resulting in noisy rPPG signals. In this paper, we propose a novel method named CodePhys, which innovatively treats rPPG measurement as a code query task in a noise-free proxy space (i.e., codebook) constructed by ground-truth PPG signals. We consider noisy rPPG features as queries and generate high-fidelity rPPG features by matching them with noise-free PPG features from the codebook. Our approach also incorporates a spatial-aware encoder network with a spatial attention mechanism to highlight physiologically active areas and uses a distillation loss to reduce the influence of non-periodic visual interference. Experimental results on four benchmark datasets demonstrate that CodePhys outperforms state-of-the-art methods in both intra-dataset and cross-dataset settings.

CodePhys: Robust Video-based Remote Physiological Measurement through Latent Codebook Querying

TL;DR

CodePhys tackles the challenge of noise-robust remote PPG extraction from facial videos by learning a noise-free latent PPG codebook and transforming rPPG measurement into a code-query task. Stage I builds a discrete codebook of GT-PPG features via a signal autoencoder, while Stage II uses a spatial-aware encoder with APB and soft feature distillation to query the codebook and reconstruct high-fidelity rPPG signals. The approach yields state-of-the-art intra- and cross-dataset performance on VIPL-HR, UBFC-rPPG, PURE, and COHFACE, and demonstrates strong robustness to common real-world video degradations. CodePhys also functions as a plug-and-play framework, offering robustness enhancements for existing end-to-end rPPG models with efficient inference and clear ablation-backed design choices. Overall, the method advances robust, non-contact heart-rate estimation by integrating discrete priors and query-based correction in a two-stage, spatially informed architecture, with potential for online adaptation and edge deployment.

Abstract

Remote photoplethysmography (rPPG) aims to measure non-contact physiological signals from facial videos, which has shown great potential in many applications. Most existing methods directly extract video-based rPPG features by designing neural networks for heart rate estimation. Although they can achieve acceptable results, the recovery of rPPG signal faces intractable challenges when interference from real-world scenarios takes place on facial video. Specifically, facial videos are inevitably affected by non-physiological factors (e.g., camera device noise, defocus, and motion blur), leading to the distortion of extracted rPPG signals. Recent rPPG extraction methods are easily affected by interference and degradation, resulting in noisy rPPG signals. In this paper, we propose a novel method named CodePhys, which innovatively treats rPPG measurement as a code query task in a noise-free proxy space (i.e., codebook) constructed by ground-truth PPG signals. We consider noisy rPPG features as queries and generate high-fidelity rPPG features by matching them with noise-free PPG features from the codebook. Our approach also incorporates a spatial-aware encoder network with a spatial attention mechanism to highlight physiologically active areas and uses a distillation loss to reduce the influence of non-periodic visual interference. Experimental results on four benchmark datasets demonstrate that CodePhys outperforms state-of-the-art methods in both intra-dataset and cross-dataset settings.

Paper Structure

This paper contains 33 sections, 16 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Conceptual comparison of CodePhys and existing typical rPPG measurement methods. (a) Directly extracting rPPG signals from the video. (b) Designing hand-crafted modules to eliminate specific visual interference, taking the typical method LiMotionRobust2023 as an example. (c) Learning a codebook to eliminate all types of visual interference. Here the cloud symbolizes the visual interference present in the video or in features, and the yellow module signifies the capability to remove such interference.
  • Figure 2: Overview of CodePhys. In Stage I, a codebook consisting of noise-free PPG features is learned by reconstructing GT-PPG signals, which is treated as the prior. In Stage II, the spatial-aware encoder (composed of a video feature extractor and a spatio-temporal encoder with an auxiliary prior branch) queries corresponding PPG features with respect to the input video. Subsequently, the decoder network reconstructs the estimated rPPG signal from the queried features. The embedding layer in Stage II is the same as that in Stage I. The DConv, TDC, Conv1D, and LN represent deformable convolution DaiDeformCNN2017, temporal difference convolution YuAuto2020, 1D convolution, and layer normalization, respectively.
  • Figure 3: Architecture of the video feature extractor $E_v$ with a spatial attention mechanism (SAM) attached. The 'S' and 'C' denote the sigmoid function and concatenation operation, respectively.
  • Figure 4: The Bland-Altman plot (a) and scatter plot (b) show the difference between ground-truth HRs and predicted HRs by CodePhys on the Fold-1 of VIPL-HR dataset.
  • Figure 5: Visual comparison of the rPPG signals (top) and HRs (bottom) predicted by CodePhys on UBFC-rPPG dataset, alongside the corresponding ground-truth.
  • ...and 5 more figures