Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Jiuding Sun; Chantal Shaib; Byron C. Wallace

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Jiuding Sun, Chantal Shaib, Byron C. Wallace

TL;DR

This work interrogates the robustness of instruction-tuned language models to unseen, semantically equivalent instructions. It demonstrates that unobserved instruction phrasings can substantially degrade zero-shot performance and that scaling alone does not fully mitigate this sensitivity. The authors propose a lightweight soft-prompt alignment technique, adding paraphrase data and KL-divergence constraints to encourage consistent representations for equivalent instructions. Across Flan-T5, Alpaca, and T0 families on MMLU and Big-Bench Lite, the approach yields consistent robustness gains, highlighting a practical path to more reliable zero-shot generalization in moderately sized models.

Abstract

Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper we ask two questions: (1) How sensitive are instruction-tuned models to the particular phrasings of instructions, and, (2) How can we make them more robust to such natural language variation? To answer the former, we collect a set of 319 instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings. We propose a simple method to mitigate this issue by introducing ``soft prompt'' embedding parameters and optimizing these to maximize the similarity between representations of semantically equivalent instructions. We show that this method consistently improves the robustness of instruction-tuned models.

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

TL;DR

Abstract

Paper Structure (582 sections, 14 figures, 19 tables)

This paper contains 582 sections, 14 figures, 19 tables.

Introduction
Related Work
Multitask learning and instruction-tuning
Evaluating prompting and instruction capabilities
Improving instruction-tuning
Instruction Datasets
Evaluation Benchmarks
Collecting New Instructions from NLP Researchers
Evaluating the Robustness of Instruction-tuned LLMs
Models and Data
Results
A Closer Look at Instruction Robustness
Scaling
Robustness with Semantic Distance
Robustness Under In-Context Learning (ICL)
...and 567 more sections

Figures (14)

Figure 1: How well do models trained on instruction-tuning datasets generalize to novel instructions (unobserved in training)? Our analysis suggests that they do not do so very well. Above we show a case where pairing an example with an observed instruction yields the correct output, while providing a distinct but semantically equivalent instruction produces an incorrect response. We propose and evaluate a simple method that improves this.
Figure 2: Using novel but valid instructions at test time (phrasings unobserved in training) consistently degrades the performance of instruction-tuned LLMs (a). Scale does not necessarily fix this (b).
Figure 3: Incorrect but observed instructions perform better on average than correct but unobserved instructions. We report averages over benchmarks, but show example instructions on the right for a specific, illustrative task. We provide all instructions in the Appendix.
Figure 4: tSNE plots of representations for the first decoded tokens of 300 randomly sampled examples from MMLU and BBL with Flan-T5 (XXL). Embeddings of observed and unobserved instructions for MMLU are similar, while for BBL they are quite different. This result holds across most but not all models considered: See the \ref{['section:embeddings']} for visualizations over all models.
Figure 5: Plots of average degradations in performance versus the semantic distance while using unobserved instructions.
...and 9 more figures

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

TL;DR

Abstract

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)