Assessing the Robustness of LLM-based NLP Software via Automated Testing
Mingxuan Xiao, Yan Xiao, Shunhui Ji, Hanbo Cai, Lei Xue, Pengcheng Zhang
TL;DR
This work tackles the robustness evaluation of LLM-based NLP software by reframing testing as a combinatorial optimization problem over full inputs (prompt+example) using the AORTA framework. It introduces ABS, the first dedicated robustness-testing method for LLM-based systems, which uses adaptive beam search and backtracking to efficiently generate high-quality adversarial test cases. Across three datasets and five open-source LLM threat models, ABS achieves superior robustness-detection performance and reduced computational cost compared with strong baselines, with notable gains in test transferability. The approach advances automated, black-box robustness testing for safety-critical NLP applications and offers a practical path to more reliable LLM-based software prior to deployment.
Abstract
Benefiting from the advancements in LLMs, NLP software has undergone rapid development. Such software is widely employed in various safety-critical tasks, such as financial sentiment analysis, toxic content moderation, and log generation. Unlike traditional software, LLM-based NLP software relies on prompts and examples as inputs. Given the complexity of LLMs and the unpredictability of real-world inputs, quantitatively assessing the robustness of such software is crucial. However, to the best of our knowledge, no automated robustness testing methods have been specifically designed to evaluate the overall inputs of LLM-based NLP software. To this end, this paper introduces the first AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a combinatorial optimization problem. Existing testing methods designed for DNN-based software can be applied to LLM-based software by AORTA, but their effectiveness is limited. To address this, we propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search. ABS is tailored for the expansive feature space of LLMs and improves testing effectiveness through an adaptive beam width and the capability for backtracking. We successfully embed 18 test methods in the designed framework AORTA and compared the test validity of ABS with three datasets and five threat models. ABS facilitates a more comprehensive and accurate robustness assessment before software deployment, with an average test success rate of 86.138%. Compared to the currently best-performing baseline PWWS, ABS significantly reduces the computational overhead by up to 3441.895 seconds per successful test case and decreases the number of queries by 218.762 times on average. Furthermore, test cases generated by ABS exhibit greater naturalness and transferability.
