Table of Contents
Fetching ...

From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition

Nazia Aslam, Abhisek Ray, Joakim Bruslund Haurum, Lukas Esterle, Kamal Nasrollahi

Abstract

Recent advances in large-scale video models have significantly improved video understanding across domains such as surveillance, healthcare, and entertainment. However, these models also amplify privacy risks by encoding sensitive attributes, including facial identity, race, and gender. While image anonymization has been extensively studied, video anonymization remains relatively underexplored, even though modern video models can leverage spatiotemporal motion patterns as biometric identifiers. To address this challenge, we propose a novel attention-driven spatiotemporal video anonymization framework based on systematic disentanglement of utility and privacy features. Our key insight is that attention mechanisms in Vision Transformers (ViTs) can be explicitly structured to separate action-relevant information from privacy-sensitive content. Building on this insight, we introduce two task-specific classification tokens, an action CLS token and a privacy CLS token, that learn complementary representations within a shared Transformer backbone. We contrast their attention distributions to compute a utility-privacy score for each spatiotemporal tubelet, and keep the top-k tubelets with the highest scores. This selectively prunes tubelets dominated by privacy cues while preserving those most critical for action recognition. Extensive experiments demonstrate that our approach maintains action recognition performance comparable to models trained on raw videos, while substantially reducing privacy leakage. These results indicate that attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics.

From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition

Abstract

Recent advances in large-scale video models have significantly improved video understanding across domains such as surveillance, healthcare, and entertainment. However, these models also amplify privacy risks by encoding sensitive attributes, including facial identity, race, and gender. While image anonymization has been extensively studied, video anonymization remains relatively underexplored, even though modern video models can leverage spatiotemporal motion patterns as biometric identifiers. To address this challenge, we propose a novel attention-driven spatiotemporal video anonymization framework based on systematic disentanglement of utility and privacy features. Our key insight is that attention mechanisms in Vision Transformers (ViTs) can be explicitly structured to separate action-relevant information from privacy-sensitive content. Building on this insight, we introduce two task-specific classification tokens, an action CLS token and a privacy CLS token, that learn complementary representations within a shared Transformer backbone. We contrast their attention distributions to compute a utility-privacy score for each spatiotemporal tubelet, and keep the top-k tubelets with the highest scores. This selectively prunes tubelets dominated by privacy cues while preserving those most critical for action recognition. Extensive experiments demonstrate that our approach maintains action recognition performance comparable to models trained on raw videos, while substantially reducing privacy leakage. These results indicate that attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics.

Paper Structure

This paper contains 23 sections, 7 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of our proposed video anonymization framework. A ViT-based anonymizer $f_A$ employs two independent CLS tokens, act_CLS and priv_CLS, to attend all tubelet sequence. Each tubelet $i$ is assigned a utility-privacy score $s_i=\alpha_i^{\text{act}} -\lambda_{\text{priv}} \cdot \alpha_i^{\text{priv}}$, where $\alpha_i^{\text{act}}$ and $\alpha_i^{\text{priv}}$ denote the CLS-to-tubelet attention weights. Tubelets are then ranked by ${s_i}$ and the top-$k$ tubelets are retained to suppress privacy-sensitive content while preserving action-relevant cues, while the remaining low-score tubelets are compressed via a token fusion module. The anonymized video is then fed to a utility branch $f_T$ and a budget model $f_B$ under action and privacy objectives.
  • Figure 2: Visualization of our framework on three actions: (a) weightlifting, (b) YoYo, and (c) playing violin. For each ten frame clip, the top row shows raw consecutive frames, and the bottom row shows the corresponding anonymized frames produced by our method.