Table of Contents
Fetching ...

Can Adversarial Examples Be Parsed to Reveal Victim Model Information?

Yuguang Yao, Jiancheng Liu, Yifan Gong, Xiaoming Liu, Yanzhi Wang, Xue Lin, Sijia Liu

TL;DR

This work investigates whether adversarial examples can reveal information about the victim model (VM) used to generate them. It introduces the Model Parsing Network (MPN), a supervised framework that predicts VM attributes $AT$, $KS$, $AF$, and $WS$ from attack instances, optionally aided by a Perturbation Estimation Network (PEN) that reconstructs perturbations $δ$ from adversarial examples $x′$. Across 7 attack types and 135 VM configurations on CIFAR-10/100 and Tiny-ImageNet, the study shows strong in-distribution generalization and demonstrates that perturbation estimates $δ_{PEN}$ lead to higher parsing accuracy than raw examples, while highlighting challenges in out-of-distribution settings and in transfer scenarios. The results reveal a tangible VM fingerprint in adversarial tooling, with implications for defense strategies and threat modeling, and they establish a link between model parsing and attack transferability through gradient-alignment analyses.

Abstract

Numerous adversarial attack methods have been developed to generate imperceptible image perturbations that can cause erroneous predictions of state-of-the-art machine learning (ML) models, in particular, deep neural networks (DNNs). Despite intense research on adversarial attacks, little effort was made to uncover 'arcana' carried in adversarial attacks. In this work, we ask whether it is possible to infer data-agnostic victim model (VM) information (i.e., characteristics of the ML model or DNN used to generate adversarial attacks) from data-specific adversarial instances. We call this 'model parsing of adversarial attacks' - a task to uncover 'arcana' in terms of the concealed VM information in attacks. We approach model parsing via supervised learning, which correctly assigns classes of VM's model attributes (in terms of architecture type, kernel size, activation function, and weight sparsity) to an attack instance generated from this VM. We collect a dataset of adversarial attacks across 7 attack types generated from 135 victim models (configured by 5 architecture types, 3 kernel size setups, 3 activation function types, and 3 weight sparsity ratios). We show that a simple, supervised model parsing network (MPN) is able to infer VM attributes from unseen adversarial attacks if their attack settings are consistent with the training setting (i.e., in-distribution generalization assessment). We also provide extensive experiments to justify the feasibility of VM parsing from adversarial attacks, and the influence of training and evaluation factors in the parsing performance (e.g., generalization challenge raised in out-of-distribution evaluation). We further demonstrate how the proposed MPN can be used to uncover the source VM attributes from transfer attacks, and shed light on a potential connection between model parsing and attack transferability.

Can Adversarial Examples Be Parsed to Reveal Victim Model Information?

TL;DR

This work investigates whether adversarial examples can reveal information about the victim model (VM) used to generate them. It introduces the Model Parsing Network (MPN), a supervised framework that predicts VM attributes , , , and from attack instances, optionally aided by a Perturbation Estimation Network (PEN) that reconstructs perturbations from adversarial examples . Across 7 attack types and 135 VM configurations on CIFAR-10/100 and Tiny-ImageNet, the study shows strong in-distribution generalization and demonstrates that perturbation estimates lead to higher parsing accuracy than raw examples, while highlighting challenges in out-of-distribution settings and in transfer scenarios. The results reveal a tangible VM fingerprint in adversarial tooling, with implications for defense strategies and threat modeling, and they establish a link between model parsing and attack transferability through gradient-alignment analyses.

Abstract

Numerous adversarial attack methods have been developed to generate imperceptible image perturbations that can cause erroneous predictions of state-of-the-art machine learning (ML) models, in particular, deep neural networks (DNNs). Despite intense research on adversarial attacks, little effort was made to uncover 'arcana' carried in adversarial attacks. In this work, we ask whether it is possible to infer data-agnostic victim model (VM) information (i.e., characteristics of the ML model or DNN used to generate adversarial attacks) from data-specific adversarial instances. We call this 'model parsing of adversarial attacks' - a task to uncover 'arcana' in terms of the concealed VM information in attacks. We approach model parsing via supervised learning, which correctly assigns classes of VM's model attributes (in terms of architecture type, kernel size, activation function, and weight sparsity) to an attack instance generated from this VM. We collect a dataset of adversarial attacks across 7 attack types generated from 135 victim models (configured by 5 architecture types, 3 kernel size setups, 3 activation function types, and 3 weight sparsity ratios). We show that a simple, supervised model parsing network (MPN) is able to infer VM attributes from unseen adversarial attacks if their attack settings are consistent with the training setting (i.e., in-distribution generalization assessment). We also provide extensive experiments to justify the feasibility of VM parsing from adversarial attacks, and the influence of training and evaluation factors in the parsing performance (e.g., generalization challenge raised in out-of-distribution evaluation). We further demonstrate how the proposed MPN can be used to uncover the source VM attributes from transfer attacks, and shed light on a potential connection between model parsing and attack transferability.
Paper Structure (15 sections, 2 equations, 12 figures, 12 tables)

This paper contains 15 sections, 2 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Schematic overview of model parsing from adversarial attacks. (Left) Attack generation leveraging the VM (victim model), with model attributes including architecture type, kernel size, activation function, and weight sparsity. (Middle) Proposed model parsing network (MPN), aiming to classify VM attributes based on adversarial examples. (Right) Demonstrating the efficacy of MPN in accurately parsing model attributes from PGD attacks madry2017towards against ResNet9 on CIFAR-10. Performance metrics for MPN are showcased across two distinct types of input: actual adversarial perturbations and estimated adversarial perturbations, detailed in Sec. \ref{['sec: Methods']}.
  • Figure 2: Model parsing in the context of transfer attacks: An effective model parsing system could accurately identify the original VM from which the adversarial attack was generated, as opposed to merely recognizing the target model intended for the transfer attack.
  • Figure 3: Model parsing via supervised learning. Adversarial examples or perturbations, crafted by attackers, serve as the input of MPN, which aims to decode VM attributes from adversarial inputs. The PEN (perturbation estimation network), introduced subsequently, acts as a preprocessing step, converting adversarial examples into inputs resembling perturbations.
  • Figure 4: The VM attribute classification accuracy of MPN under different input formats (adversarial perturbations $\boldsymbol{\delta}$ vs. examples $\mathbf x^\prime$) and parsing networks (ConvNet-4 vs. MLP). The accuracy is measured in the context of in-distribution generalization. The attack data is generated from attack methods given in Table \ref{['tab: attacks']}, with $\ell_\infty$ attack strength $\epsilon = 8/255$ and $\ell_2$ attack strength $\epsilon = 0.5$ on CIFAR-10.
  • Figure 5: Testing accuracies (%) of MPN when trained on adversarial perturbations generated by PGD$\ell_\infty$ using different attack strengths ($\epsilon$) and evaluated using different attack strengths as well. Other setups are consistent with in Table \ref{['tab: in_distribution']}.
  • ...and 7 more figures