Table of Contents
Fetching ...

Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding

Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah

TL;DR

The paper addresses the privacy risk posed by latent visual features from video foundation models and critiques pixel-level privacy methods for lacking generality and requiring retraining. It introduces SPLAVU, a latent-space privacy framework that attaches a lightweight Anonymizing Adapter Module (AAM) to frozen encoders and optimizes three objectives: a clip-level privacy loss $\mathcal{L}_B$, a multi-task utility loss $\mathcal{L}_{T^*}$, and a latent-consistency loss $\mathcal{L}_{LC}$ to preserve generalization across seen and unseen tasks. Empirical results show SPLAVU reduces privacy leakage by about $35\%$ while maintaining near-baseline performance on Action Recognition, Temporal Action Detection, and Anomaly Detection, and it promotes fairness by mitigating gender biases through new evaluation protocols. The approach scales to large video foundation models with low training cost and demonstrates data efficiency, benefiting practical deployments in privacy-preserving video understanding and debiasing contexts.

Abstract

We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.

Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding

TL;DR

The paper addresses the privacy risk posed by latent visual features from video foundation models and critiques pixel-level privacy methods for lacking generality and requiring retraining. It introduces SPLAVU, a latent-space privacy framework that attaches a lightweight Anonymizing Adapter Module (AAM) to frozen encoders and optimizes three objectives: a clip-level privacy loss , a multi-task utility loss , and a latent-consistency loss to preserve generalization across seen and unseen tasks. Empirical results show SPLAVU reduces privacy leakage by about while maintaining near-baseline performance on Action Recognition, Temporal Action Detection, and Anomaly Detection, and it promotes fairness by mitigating gender biases through new evaluation protocols. The approach scales to large video foundation models with low training cost and demonstrates data efficiency, benefiting practical deployments in privacy-preserving video understanding and debiasing contexts.

Abstract

We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.

Paper Structure

This paper contains 27 sections, 13 equations, 5 figures, 14 tables, 1 algorithm.

Figures (5)

  • Figure 1: Our proposed latent anonymization setup (red) utilizes large pretrained video encoders, applying a lightweight anonymizer that maintains performance on multiple video understanding tasks while strongly reducing performance on private attribute prediction tasks (right).
  • Figure 2: Workflow illustrating the SPLAVU training process. The process begins with a video clip $\mathbf{x}^{(i)}_{t}$, from which two random frames are sampled to create static clips. All clips are passed through the frozen video encoder $f_E$ to extract latent features, then further processed by our Anonymization Adapter Module (AAM) $f_A$. The temporal clip features are used for the latent consistency loss and given to the set of task-specific classifier heads $f_{T^*}$. The two static clip features ($\mathbf{\bar{h}}^{(i)}_{\bar{t1}}$, $\mathbf{\bar{h}}^{(i)}_{\bar{t2}}$) are utilized in the self-supervised mutual information minimization objective. Gradients from all losses are back-propagated through AAM. A complete training algorithm is provided in Appendix Sec. \ref{['sec:algorithm']}.
  • Figure 3: Privacy-utility trade-off on PA-HMDB51. Privacy measured by attacker cMAP ($\downarrow$), utility by AR acc. ($\uparrow$). Different points show varied privacy/utility weights $\omega_B, \omega_T$. SPLAVU achieves a favorable trade-off.
  • Figure 4: Ablation on tasks seen during anonymization training. The checkmark (✓) labels seen tasks, x-mark (✗) and highlighted cells indicate tasks unseen during training. Performance generalizes to unseen tasks, while directly training further improves results.
  • Figure 4: Graph showcasing the overall runtime and accuracy of 3 privacy-preserving methods. The x-axis shows time in seconds and the y-axis has an overall score for accuracy/privacy computed in \ref{['eq:combined_performance']}.