Table of Contents
Fetching ...

Are Large Language Models Really Robust to Word-Level Perturbations?

Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, Dacheng Tao

TL;DR

The paper introduces TREvaL, a reward-model-based framework for evaluating the robustness of large language models to word-level perturbations on open-ended prompts. By perturbing 1k Natural Questions prompts and scoring the generated content with pre-trained reward and cost models, TREvaL measures robustness via drop rates rather than simple accuracy. Empirical results reveal widespread vulnerability to misspellings, swaps, and synonyms, with robustness often deteriorating as models undergo fine-tuning (SFT/RLHF), suggesting trade-offs between performance and stability. The authors release TREvaL code and datasets and argue for robustness-aware training paradigms to improve resilience in future LLM generations.

Abstract

The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary LLMs. To address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the longer conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters, which may exhibit oversimplification and inherent biases. Our extensive empirical experiments demonstrate that TREvaL provides an innovative method for evaluating the robustness of an LLM. Furthermore, our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage. Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted. The code of TREval is available in https://github.com/Harry-mic/TREvaL.

Are Large Language Models Really Robust to Word-Level Perturbations?

TL;DR

The paper introduces TREvaL, a reward-model-based framework for evaluating the robustness of large language models to word-level perturbations on open-ended prompts. By perturbing 1k Natural Questions prompts and scoring the generated content with pre-trained reward and cost models, TREvaL measures robustness via drop rates rather than simple accuracy. Empirical results reveal widespread vulnerability to misspellings, swaps, and synonyms, with robustness often deteriorating as models undergo fine-tuning (SFT/RLHF), suggesting trade-offs between performance and stability. The authors release TREvaL code and datasets and argue for robustness-aware training paradigms to improve resilience in future LLM generations.

Abstract

The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary LLMs. To address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the longer conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters, which may exhibit oversimplification and inherent biases. Our extensive empirical experiments demonstrate that TREvaL provides an innovative method for evaluating the robustness of an LLM. Furthermore, our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage. Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted. The code of TREval is available in https://github.com/Harry-mic/TREvaL.
Paper Structure (31 sections, 10 figures, 6 tables)

This paper contains 31 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: This Figure illustrates the primary workflow of the TREvaL process during a single evaluation round. Clean prompts undergo various types of perturbations and are assessed in comparison. The evaluation results indicate that Large Language Models exhibit a lack of robustness when confronted with word-level perturbations.
  • Figure 2: The impact of various stages in the robustness of Beaver family. As the level of perturbation intensifies, the rate of score decline for the three LLMs within the family markedly escalates. Furthermore, at a given level of perturbation, advancing through the stages introduces greater instability to the LLMs, most notably during the RLHF stage. This underscores the critical need to enhance model robustness, particularly in the RLHF stage.
  • Figure 3: The Reward Distribution of Llama2-chat-7B after misspelling perturbation. As the attack intensity gradually increases, we observe a widening disparity between the distributions of attack$\_$rewards and clean$\_$rewards. These distributions progressively skew towards lower values. Moreover, the frequency of high-quality responses diminishes, with the counts within different intervals gradually converging toward a mean value.
  • Figure 4: The landscape of different stages of Beaver-7B. It becomes increasingly clear that the robustness of Large Language Models deteriorates as the fine-tuning process advances. This finding is consistent with the conclusions from our robustness evaluations, indicating that while fine-tuning improves the model's performance, it concurrently compromises its robustness.
  • Figure 5: Beaver-7B Reward Distribution (Misspelling, Swapping, Synonym)
  • ...and 5 more figures