Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah
TL;DR
The paper addresses the privacy risk posed by latent visual features from video foundation models and critiques pixel-level privacy methods for lacking generality and requiring retraining. It introduces SPLAVU, a latent-space privacy framework that attaches a lightweight Anonymizing Adapter Module (AAM) to frozen encoders and optimizes three objectives: a clip-level privacy loss $\mathcal{L}_B$, a multi-task utility loss $\mathcal{L}_{T^*}$, and a latent-consistency loss $\mathcal{L}_{LC}$ to preserve generalization across seen and unseen tasks. Empirical results show SPLAVU reduces privacy leakage by about $35\%$ while maintaining near-baseline performance on Action Recognition, Temporal Action Detection, and Anomaly Detection, and it promotes fairness by mitigating gender biases through new evaluation protocols. The approach scales to large video foundation models with low training cost and demonstrates data efficiency, benefiting practical deployments in privacy-preserving video understanding and debiasing contexts.
Abstract
We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.
