Adversarial Robustness of Open-source Text Classification Models and Fine-Tuning Chains

Hao Qin; Mingyang Li; Junjie Wang; Qing Wang

Adversarial Robustness of Open-source Text Classification Models and Fine-Tuning Chains

Hao Qin, Mingyang Li, Junjie Wang, Qing Wang

TL;DR

The paper tackles a practical security concern: the adversarial robustness of open-source text classification models and the chains formed by upstream–downstream fine-tuning on Hugging Face. It conducts a large-scale empirical study over $45{,}688$ HF models, constructs $29{,}148$ fine-tuning chains, and analyzes robustness under six attacks across four datasets, with results like an average attack success rate of $52.70\%$ and a $12.60\%$ average increase in ASR after fine-tuning for certain architectures. Key findings show pervasive model reuse, architecture-dependent robustness trends (notably Degradation for BERT-based chains and improvement for Electra), and a high transferability of adversarial vulnerability along chains (average $78.57\%$). The work highlights practical implications for secure model reuse, defense data generation, and robustness-aware deployment, offering publicly available data to support replication and further study.

Abstract

Context:With the advancement of artificial intelligence (AI) technology and applications, numerous AI models have been developed, leading to the emergence of open-source model hosting platforms like Hugging Face (HF). Thanks to these platforms, individuals can directly download and use models, as well as fine-tune them to construct more domain-specific models. However, just like traditional software supply chains face security risks, AI models and fine-tuning chains also encounter new security risks, such as adversarial attacks. Therefore, the adversarial robustness of these models has garnered attention, potentially influencing people's choices regarding open-source models. Objective:This paper aims to explore the adversarial robustness of open-source AI models and their chains formed by the upstream-downstream relationships via fine-tuning to provide insights into the potential adversarial risks. Method:We collect text classification models on HF and construct the fine-tuning chains.Then, we conduct an empirical analysis of model reuse and associated robustness risks under existing adversarial attacks from two aspects, i.e., models and their fine-tuning chains. Results:Despite the models' widespread downloading and reuse, they are generally susceptible to adversarial attack risks, with an average of 52.70% attack success rate. Moreover, fine-tuning typically exacerbates this risk, resulting in an average 12.60% increase in attack success rates. We also delve into the influence of factors such as attack techniques, datasets, and model architectures on the success rate, as well as the transitivity along the model chains.

Adversarial Robustness of Open-source Text Classification Models and Fine-Tuning Chains

TL;DR

HF models, constructs

fine-tuning chains, and analyzes robustness under six attacks across four datasets, with results like an average attack success rate of

and a

average increase in ASR after fine-tuning for certain architectures. Key findings show pervasive model reuse, architecture-dependent robustness trends (notably Degradation for BERT-based chains and improvement for Electra), and a high transferability of adversarial vulnerability along chains (average

). The work highlights practical implications for secure model reuse, defense data generation, and robustness-aware deployment, offering publicly available data to support replication and further study.

Abstract

Paper Structure (30 sections, 10 figures, 2 tables)

This paper contains 30 sections, 10 figures, 2 tables.

Introduction
Background
Hugging Face (HF)
Adversarial Attacks And Adversarial Robustness Assessment
Model Collection and Fine-tuning Chain Construction
Model Collection
Fine-tuning Chain Construction
Answering RQ1
Experiment Design
Results And Discussions
RQ1.1: How popular are the open-source text classification models on HF?
RQ1.2: How prevalent is the model reuse within HF?
Answering RQ2
Experiment Design
Chain and Model Selection
...and 15 more sections

Figures (10)

Figure 1: Illustrative examples of fine-tuning chain
Figure 2: The general framework for adversarial attack techniques
Figure 3: The prompt for identifying the upstream model name from the descriptions in "Model Card"
Figure 4: The download distribution of the text classification models on HF (RQ1)
Figure 5: Distribution of model chain lengths (RQ1)
...and 5 more figures

Adversarial Robustness of Open-source Text Classification Models and Fine-Tuning Chains

TL;DR

Abstract

Adversarial Robustness of Open-source Text Classification Models and Fine-Tuning Chains

Authors

TL;DR

Abstract

Table of Contents

Figures (10)