Editing Physiological Signals in Videos Using Latent Representations
Tianwen Zhou, Akshay Paruchuri, Josef Spjut, Kaan Akşit
TL;DR
This work addresses privacy risks from non-contact physiological sensing in facial videos by introducing PhysioLatent, a latent-space editing framework that controllably modulates heart-rate signals while preserving visual fidelity. The method encodes video frames with a frozen 3D Causal VAE and conditions the latent representation with a CLIP text embedding of the target heart rate, using spatio-temporal self-attention and FiLM-based decoder conditioning to ensure temporally coherent, subtle edits. A composite loss combining $ ext{L}_F = ext{MSE} + ext{LPIPS}$ with physiological terms $ ext{L}_{wave}$ and $ ext{L}_{freq}$ (and a curriculum ramp) guides the model to match the desired $HR_d$ while maintaining visual realism. Empirical results on multiple datasets show high perceptual quality (average PSNR ≈ 38.96 dB, SSIM ≈ 0.98) and accurate HR modulation (MAE ≈ 10 bpm, MAPE ≈ 10%), with a demonstrated HR-removal mode and strong robustness across estimators, enabling privacy-preserving editing and synthetic data generation in video pipelines.
Abstract
Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design's controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.
