Table of Contents
Fetching ...

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, Marek Rei

TL;DR

This paper studies how to make LLMs robust to perturbations of task-level instructions, focusing on character- and word-level edits in classification tasks. It compares multiple robustness strategies, notably iterative self-denoising (SDi) including supervised fine-tuned variants (SFT-SDi), against perplexity smoothing, instruction ensembling, and representation alignment, using Llama 3 and Flan-T5 across CoLA, QNLI, and SST-2. The key finding is that iterative self-denoising, especially SFT-SDi, yields the largest average improvements in the robustness metric ($PDR$), while perplexity smoothing harms performance and other methods provide moderate gains. The results suggest that letting the model self-correct perturbed instructions is a practical and effective approach, with implications for improving instruction-tuned systems and guiding future work toward larger models and more diverse perturbations.

Abstract

Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. Existing methods to enhance LLM robustness are primarily focused on perturbed data samples, whereas improving resiliency to perturbations of task-level instructions has remained relatively underexplored. In this work, we focus on character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We experiment with a variety of techniques to enhance the robustness of LLMs, including self-denoising and representation alignment, testing different models (Llama 3 and Flan-T5), datasets (CoLA, QNLI, SST-2) and instructions (both task-oriented and role-oriented). We find that, on average, self-denoising -- whether performed by a frozen LLM or a fine-tuned model -- achieves substantially higher performance gains than alternative strategies, including more complex baselines such as ensembling and supervised methods.

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

TL;DR

This paper studies how to make LLMs robust to perturbations of task-level instructions, focusing on character- and word-level edits in classification tasks. It compares multiple robustness strategies, notably iterative self-denoising (SDi) including supervised fine-tuned variants (SFT-SDi), against perplexity smoothing, instruction ensembling, and representation alignment, using Llama 3 and Flan-T5 across CoLA, QNLI, and SST-2. The key finding is that iterative self-denoising, especially SFT-SDi, yields the largest average improvements in the robustness metric (), while perplexity smoothing harms performance and other methods provide moderate gains. The results suggest that letting the model self-correct perturbed instructions is a practical and effective approach, with implications for improving instruction-tuned systems and guiding future work toward larger models and more diverse perturbations.

Abstract

Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. Existing methods to enhance LLM robustness are primarily focused on perturbed data samples, whereas improving resiliency to perturbations of task-level instructions has remained relatively underexplored. In this work, we focus on character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We experiment with a variety of techniques to enhance the robustness of LLMs, including self-denoising and representation alignment, testing different models (Llama 3 and Flan-T5), datasets (CoLA, QNLI, SST-2) and instructions (both task-oriented and role-oriented). We find that, on average, self-denoising -- whether performed by a frozen LLM or a fine-tuned model -- achieves substantially higher performance gains than alternative strategies, including more complex baselines such as ensembling and supervised methods.

Paper Structure

This paper contains 26 sections, 2 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Example perturbations of an instruction for sentiment classification, shown in (a). The perturbation can be at the character level, as shown in (b), or at the word level, as shown in (c).
  • Figure 2: PDR and semantic similarity for TextFooler and DeepWordBug, averaged across models, datasets and instruction variants. For semantic similarity, we use the cosine similarity between the 4096-dimensional sentence embeddings encoded by E5 Mistral wang2024improvingtextembeddingslarge. We choose this model since, at the time of writing, it achieves leading performance on the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2023), which is designed to evaluate the quality of text embeddings on a variety of tasks, including semantic similarity and text classification.