Table of Contents
Fetching ...

Selective, Interpretable, and Motion Consistent Privacy Attribute Obfuscation for Action Recognition

Filip Ilic, He Zhao, Thomas Pock, Richard P. Wildes

TL;DR

This work highlights the limitations of current paradigms and proposes a solution: Human selected privacy templates that yield interpretability by design, an ob-fuscation scheme that selectively hides attributes and also induces temporal consistency, which is important in action recognition.

Abstract

Concerns for the privacy of individuals captured in public imagery have led to privacy-preserving action recognition. Existing approaches often suffer from issues arising through obfuscation being applied globally and a lack of interpretability. Global obfuscation hides privacy sensitive regions, but also contextual regions important for action recognition. Lack of interpretability erodes trust in these new technologies. We highlight the limitations of current paradigms and propose a solution: Human selected privacy templates that yield interpretability by design, an obfuscation scheme that selectively hides attributes and also induces temporal consistency, which is important in action recognition. Our approach is architecture agnostic and directly modifies input imagery, while existing approaches generally require architecture training. Our approach offers more flexibility, as no retraining is required, and outperforms alternatives on three widely used datasets.

Selective, Interpretable, and Motion Consistent Privacy Attribute Obfuscation for Action Recognition

TL;DR

This work highlights the limitations of current paradigms and proposes a solution: Human selected privacy templates that yield interpretability by design, an ob-fuscation scheme that selectively hides attributes and also induces temporal consistency, which is important in action recognition.

Abstract

Concerns for the privacy of individuals captured in public imagery have led to privacy-preserving action recognition. Existing approaches often suffer from issues arising through obfuscation being applied globally and a lack of interpretability. Global obfuscation hides privacy sensitive regions, but also contextual regions important for action recognition. Lack of interpretability erodes trust in these new technologies. We highlight the limitations of current paradigms and propose a solution: Human selected privacy templates that yield interpretability by design, an obfuscation scheme that selectively hides attributes and also induces temporal consistency, which is important in action recognition. Our approach is architecture agnostic and directly modifies input imagery, while existing approaches generally require architecture training. Our approach offers more flexibility, as no retraining is required, and outperforms alternatives on three widely used datasets.
Paper Structure (11 sections, 6 equations, 8 figures, 2 tables)

This paper contains 11 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our goal is to hide privacy attributes without action recognition performance dropping. Left: Arbitrary images can be used to specify an interpretable template library defined by privacy attributes. Middle: A salience map is generated from privacy templates; example illustrates use of templates for personal identification. Right: The source video is masked with noise as guided by salience and animated by source video optical flow. Salience makes masking selective to privacy sensitive regions, while preserving scene context; optical flow preserves motion -- both of which are critical for action recognition. The obfuscated video can be input directly to arbitrary privacy and action recognition systems without retraining. Zoomed circles highlight details only for illustration.
  • Figure 2: Overview of Method. \ref{['fig:overview']} We present a privacy module that builds atop three components: (i) a semantic template library that contains attributes to be hidden, (ii) a descriptor matcher to localize template features in videos to be obscured and, (iii) an obfuscation method that is sensitive with respect to motion present in the scene. \ref{['fig:matching']} A semantic descriptor matcher based on DINO caron2021emerging-ViT dosovitskiy2020image keys is used to determine privacy salient regions in a video based on the template library. In our case, regions of interest correspond to those that can identify a person; however, this component can be adapted for other privacy attributes through specification of different templates. The result is a saliency map. \ref{['fig:selective']} The saliency map is used as a weight to apply noise to the regions. The noise, however, is not static, but is warped with optical flow with an initial noise pattern image, $N(\mathbf{p}, 1)$, for the purposes of preserving motion information in the source video. The similarity maps of all aggregated relevant privacy attributes are used to weigh the noise and apply it to the input image, obfuscating privacy sensitive information while not destroying the underlying temporal signal.
  • Figure 3: Template Library consisting of Patches Chosen from Anatomical Landmark Regions. These specific images are passed through a DINO-ViT feature extractor. The keys, corresponding to spatial locations of the highlighted patches, are chosen as the templates for matching to input images to obtain semantically similar regions for obfuscation.
  • Figure 4: Saliency Maps for Descriptors in the Template Library. The manual selection of these templates allows for interpretability of the obfuscated parts of the image by design. The matched DINO-ViT features capture rich semantic information and allow for detailed spatial localization due to the nature of vision transformers. These saliency maps can then be combined for obfuscating any combination of templates depending on the task at hand.
  • Figure 5: Obfuscation with a Single Attribute and the Impact on Performance. Attribute importance is dataset dependent. For example, notice how the 'Hand' template contributes to a large decrease in action recognition performance on IPN, as the action is determined soley by the hand, whereas on SBU it does not. Optimally blue is high and red is low. Corresponding qualitative examples of saliency maps for each individual template are shown in Fig. \ref{['fig:saliency']}. Bold text along the abscissa of each plot indicates the templates used for the final results.
  • ...and 3 more figures