Table of Contents
Fetching ...

Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition

Weizhe Kong, Xiao Wang, Ruichong Gao, Chenglong Li, Yu Zhang, Xing Yang, Yaowei Wang, Jin Tang

TL;DR

This work identifies security vulnerabilities in Pedestrian Attribute Recognition (PAR) and introduces ASL-PAR, an adversarial attack/defense framework built on the CLIP-based PromptPAR pipeline. The core idea combines semantic perturbation and label perturbation to disrupt visual-text alignment and attribute predictions, while constraining perturbations with an $L_{\infty}$ bound to maintain imperceptibility. A defense strategy pairs an input-space noise filter with prompt-level text fine-tuning to mitigate semantic-offset effects in the CLIP space, and a weighted loss combines cross-entropy with a CLIP-guided loss for robust training. Experiments across digital and physical domains on PETA, PA100K, MSP60K, and RAPv2 demonstrate strong attack efficacy and partial cross-dataset transferability, with limitations in cross-model generalization and domain shifts guiding future work.

Abstract

Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR.

Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition

TL;DR

This work identifies security vulnerabilities in Pedestrian Attribute Recognition (PAR) and introduces ASL-PAR, an adversarial attack/defense framework built on the CLIP-based PromptPAR pipeline. The core idea combines semantic perturbation and label perturbation to disrupt visual-text alignment and attribute predictions, while constraining perturbations with an bound to maintain imperceptibility. A defense strategy pairs an input-space noise filter with prompt-level text fine-tuning to mitigate semantic-offset effects in the CLIP space, and a weighted loss combines cross-entropy with a CLIP-guided loss for robust training. Experiments across digital and physical domains on PETA, PA100K, MSP60K, and RAPv2 demonstrate strong attack efficacy and partial cross-dataset transferability, with limitations in cross-model generalization and domain shifts guiding future work.

Abstract

Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR.

Paper Structure

This paper contains 20 sections, 11 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Visualization of Global/Local Adversarial Attack for the Pedestrian Attribute Recognition.
  • Figure 2: Comparison between existing adversarial attackers (a, b) and our newly proposed one (c).
  • Figure 3: An overview of our proposed adversarial attack framework for pedestrian attribute recognition, termed ASL-PAR. Given the pedestrian image and attribute set, the CLIP model is adopted for feature extraction and attribute embedding. The two features are fed into the multi-modal Transformer for attribute recognition. Based on the PAR framework, we design a novel semantic perturbation and label perturbation attack to generate adversarial perturbations.
  • Figure 4: An overview of our proposed defense strategy for pedestrian part semantic adversarial attacks.
  • Figure 5: Global noise visualization results of the proposed attack method and other attack methods on the PETA dataset.
  • ...and 1 more figures