Table of Contents
Fetching ...

ABFS: Natural Robustness Testing for LLM-based NLP Software

Mingxuan Xiao, Yan Xiao, Shunhui Ji, Yunhe Li, Lei Xue, Pengcheng Zhang

TL;DR

This paper tackles robustness testing for LLM-based NLP software by treating the combined prompt and example as a single input and searching for natural adversarial perturbations. The authors introduce ABFS, which uses Best-First Search to explore the perturbation space efficiently and an adaptive Word Importance Ranking to preserve naturalness, aided by WordNet-based synonym substitutions. Across three datasets and five threat models, ABFS consistently achieves higher robustness discovery and better naturalness than baselines, while also transferring well across models and reducing testing cost. The work advances practical reliability assessments for LLM-based NLP applications, particularly in safety-critical domains, by enabling a comprehensive, efficient, and transferable robustness evaluation before deployment.

Abstract

Owing to the exceptional performance of Large Language Models (LLMs) in Natural Language Processing (NLP) tasks, LLM-based NLP software has rapidly gained traction across various domains, such as financial analysis and content moderation. However, these applications frequently exhibit robustness deficiencies, where slight perturbations in input (prompt+example) may lead to erroneous outputs. Current robustness testing methods face two main limitations: (1) low testing effectiveness, limiting the applicability of LLM-based software in safety-critical scenarios, and (2) insufficient naturalness of test cases, reducing the practical value of testing outcomes. To address these issues, this paper proposes ABFS, a straightforward yet effective automated testing method that, for the first time, treats the input prompts and examples as a unified whole for robustness testing. Specifically, ABFS formulates the testing process as a combinatorial optimization problem, employing Best-First Search to identify successful test cases within the perturbation space and designing a novel Adaptive control strategy to enhance test case naturalness. We evaluate the robustness testing performance of ABFS on three datasets across five threat models. On Llama2-13b, the traditional StressTest achieves only a 13.273% success rate, while ABFS attains a success rate of 98.064%, supporting a more comprehensive robustness assessment before software deployment. Compared to baseline methods, ABFS introduces fewer modifications to the original input and consistently generates test cases with superior naturalness. Furthermore, test cases generated by ABFS exhibit stronger transferability and higher testing efficiency, significantly reducing testing costs.

ABFS: Natural Robustness Testing for LLM-based NLP Software

TL;DR

This paper tackles robustness testing for LLM-based NLP software by treating the combined prompt and example as a single input and searching for natural adversarial perturbations. The authors introduce ABFS, which uses Best-First Search to explore the perturbation space efficiently and an adaptive Word Importance Ranking to preserve naturalness, aided by WordNet-based synonym substitutions. Across three datasets and five threat models, ABFS consistently achieves higher robustness discovery and better naturalness than baselines, while also transferring well across models and reducing testing cost. The work advances practical reliability assessments for LLM-based NLP applications, particularly in safety-critical domains, by enabling a comprehensive, efficient, and transferable robustness evaluation before deployment.

Abstract

Owing to the exceptional performance of Large Language Models (LLMs) in Natural Language Processing (NLP) tasks, LLM-based NLP software has rapidly gained traction across various domains, such as financial analysis and content moderation. However, these applications frequently exhibit robustness deficiencies, where slight perturbations in input (prompt+example) may lead to erroneous outputs. Current robustness testing methods face two main limitations: (1) low testing effectiveness, limiting the applicability of LLM-based software in safety-critical scenarios, and (2) insufficient naturalness of test cases, reducing the practical value of testing outcomes. To address these issues, this paper proposes ABFS, a straightforward yet effective automated testing method that, for the first time, treats the input prompts and examples as a unified whole for robustness testing. Specifically, ABFS formulates the testing process as a combinatorial optimization problem, employing Best-First Search to identify successful test cases within the perturbation space and designing a novel Adaptive control strategy to enhance test case naturalness. We evaluate the robustness testing performance of ABFS on three datasets across five threat models. On Llama2-13b, the traditional StressTest achieves only a 13.273% success rate, while ABFS attains a success rate of 98.064%, supporting a more comprehensive robustness assessment before software deployment. Compared to baseline methods, ABFS introduces fewer modifications to the original input and consistently generates test cases with superior naturalness. Furthermore, test cases generated by ABFS exhibit stronger transferability and higher testing efficiency, significantly reducing testing costs.

Paper Structure

This paper contains 24 sections, 9 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Slightly perturbed text (green) can mislead ChatGPT into judging the label of financial news from "POSITIVE" (with a confidence of 95%) to "NEGATIVE" (with a confidence of 70%).
  • Figure 2: Overview of ABFS.
  • Figure 3: Bipolar adjective structure of WordNet.
  • Figure 4: An example of searching for successful test cases in the transformation space, where black arrows represent synonym replacement operations, blue and orange dashed arrows indicate the paths taken by the greedy search and BFS, respectively. "Score" represents the confidence score of the threat model for the ground truth label of the input, with lower scores indicating a more effective search strategy.
  • Figure 5: Results of test time overhead on different datasets and threat models (want $\downarrow$).
  • ...and 1 more figures