Table of Contents
Fetching ...

Editing Physiological Signals in Videos Using Latent Representations

Tianwen Zhou, Akshay Paruchuri, Josef Spjut, Kaan Akşit

TL;DR

This work addresses privacy risks from non-contact physiological sensing in facial videos by introducing PhysioLatent, a latent-space editing framework that controllably modulates heart-rate signals while preserving visual fidelity. The method encodes video frames with a frozen 3D Causal VAE and conditions the latent representation with a CLIP text embedding of the target heart rate, using spatio-temporal self-attention and FiLM-based decoder conditioning to ensure temporally coherent, subtle edits. A composite loss combining $ ext{L}_F = ext{MSE} + ext{LPIPS}$ with physiological terms $ ext{L}_{wave}$ and $ ext{L}_{freq}$ (and a curriculum ramp) guides the model to match the desired $HR_d$ while maintaining visual realism. Empirical results on multiple datasets show high perceptual quality (average PSNR ≈ 38.96 dB, SSIM ≈ 0.98) and accurate HR modulation (MAE ≈ 10 bpm, MAPE ≈ 10%), with a demonstrated HR-removal mode and strong robustness across estimators, enabling privacy-preserving editing and synthetic data generation in video pipelines.

Abstract

Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design's controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.

Editing Physiological Signals in Videos Using Latent Representations

TL;DR

This work addresses privacy risks from non-contact physiological sensing in facial videos by introducing PhysioLatent, a latent-space editing framework that controllably modulates heart-rate signals while preserving visual fidelity. The method encodes video frames with a frozen 3D Causal VAE and conditions the latent representation with a CLIP text embedding of the target heart rate, using spatio-temporal self-attention and FiLM-based decoder conditioning to ensure temporally coherent, subtle edits. A composite loss combining with physiological terms and (and a curriculum ramp) guides the model to match the desired while maintaining visual realism. Empirical results on multiple datasets show high perceptual quality (average PSNR ≈ 38.96 dB, SSIM ≈ 0.98) and accurate HR modulation (MAE ≈ 10 bpm, MAPE ≈ 10%), with a demonstrated HR-removal mode and strong robustness across estimators, enabling privacy-preserving editing and synthetic data generation in video pipelines.

Abstract

Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design's controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.

Paper Structure

This paper contains 25 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Our learned framework modifies physiological signals in videos by editing them using fused latent representations, maintaining good visual fidelity in the output videos. The input video and heart rate text prompt are encoded and fused to produce the latent representation, which is compatible with foundational generative processes. Source images are from MMSE-HR 7780743.
  • Figure 2: The two key characteristics of physiological signals in video: temporal coherence and visually imperceptible change. Source images are from MMSE-HR 7780743.
  • Figure 3: Overview of the proposed framework. An input facial video and a prompt are encoded into latent representations $z$ and $c$ using a 3D Causal VAE yang2024cogvideox encoder and a CLIP text embedder, respectively. The fused latent features are processed by spatio-temporal layers with to inject temporal coherence and subtle variations. The 3D Causal VAE decoder reconstructs the output video with conditioning, which is supervised by visual fidelity and physiological losses to ensure perceptual quality and accurate modulation. To further enhance visual fidelity, we incorporate a face detection module that generates a face mask $M$, replacing only the facial region of the input video with the decoder output. Source images are from PURE stricker2014video.
  • Figure 4: Comparison of proposed spatio-temporal layer settings. (a) A naïve baseline that fuses features using only stacked pseudo (2+1)D convolutions, which captures local patterns but fails to model long-range dependencies. (b) Our improved design that addresses the temporal correlation of signals by introducing decomposable space-time self-attention, with injecting the desired signal into the temporal stream for fine-grained modulation. $z$ and $z'$ denote the input and output latent vectors, and $c$ corresponds to the embedded text prompt.
  • Figure 5: Qualitative comparison results of our proposed framework. We visualize representative novel frames from multiple datasets before and after modification, together with the estimated rPPG signals under the POS estimator and the corresponding Power Spectrum Density (PSD) curves. We target values commonly encountered in real-life scenarios. In addition, we perform zooming on selected regions of the frames to better reveal fine-grained details. From zoomed-in results, it can be observed that due to the encoding–decoding pipeline of the backbone, local distortions appear in high-frequency areas such as edges and textures. Source images are from MMPD 10340857.
  • ...and 3 more figures