Table of Contents
Fetching ...

FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, Jey Han Lau

TL;DR

FLUKE presents a task-agnostic framework to evaluate NLP robustness by generating minimal, linguistically motivated test modifications across orthography, morphology, syntax, semantics, discourse, and language varieties using LLM prompts with human validation. The framework is demonstrated on six tasks (four classification, two generation) and a range of models from PLMs to reasoning LLMs, measuring robustness with the unrobustness score $U = \frac{1}{N} \sum_{i=1}^{N} | m_i - o_i | \cdot 100$. Key findings show that robustness is highly task-dependent, that LLMs are not universally more robust than PLMs, and that linguistically valid modifications often reveal greater brittleness than traditional adversarial perturbations; importantly, a model’s ability to generate a linguistic modification does not predict its downstream robustness. FLUKE thus provides a comprehensive, linguistically grounded, task-agnostic testing paradigm and practical resources (prompts, scripts, dashboard) for model developers to assess and disclose robustness characteristics, potentially informing model cards and deployment decisions.

Abstract

We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels -- from orthography to dialect and style -- and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.

FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

TL;DR

FLUKE presents a task-agnostic framework to evaluate NLP robustness by generating minimal, linguistically motivated test modifications across orthography, morphology, syntax, semantics, discourse, and language varieties using LLM prompts with human validation. The framework is demonstrated on six tasks (four classification, two generation) and a range of models from PLMs to reasoning LLMs, measuring robustness with the unrobustness score . Key findings show that robustness is highly task-dependent, that LLMs are not universally more robust than PLMs, and that linguistically valid modifications often reveal greater brittleness than traditional adversarial perturbations; importantly, a model’s ability to generate a linguistic modification does not predict its downstream robustness. FLUKE thus provides a comprehensive, linguistically grounded, task-agnostic testing paradigm and practical resources (prompts, scripts, dashboard) for model developers to assess and disclose robustness characteristics, potentially informing model cards and deployment decisions.

Abstract

We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels -- from orthography to dialect and style -- and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.

Paper Structure

This paper contains 37 sections, 1 equation, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Generation Performance vs Robustness (100-U) of GPT-4o
  • Figure 2: The example annotation page for the use case Sentiment Analysis (SA) for the linguistic capability test derivation in Stage 1.
  • Figure 3: The example annotation page for the use case Sentiment Analysis (SA) in Stage 2.