Table of Contents
Fetching ...

A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios

Samuel Ackerman, Ella Rabinovich, Eitan Farchi, Ateret Anaby-Tavor

TL;DR

A novel metric for assessing a model robustness is proposed, and its benefits in the non-adversarial scenario are demonstrated by empirical evaluation of several models on the created datasets.

Abstract

We evaluate the robustness of several large language models on multiple datasets. Robustness here refers to the relative insensitivity of the model's answers to meaning-preserving variants of their input. Benchmark datasets are constructed by introducing naturally-occurring, non-malicious perturbations, or by generating semantically equivalent paraphrases of input questions or statements. We further propose a novel metric for assessing a model robustness, and demonstrate its benefits in the non-adversarial scenario by empirical evaluation of several models on the created datasets.

A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios

TL;DR

A novel metric for assessing a model robustness is proposed, and its benefits in the non-adversarial scenario are demonstrated by empirical evaluation of several models on the created datasets.

Abstract

We evaluate the robustness of several large language models on multiple datasets. Robustness here refers to the relative insensitivity of the model's answers to meaning-preserving variants of their input. Benchmark datasets are constructed by introducing naturally-occurring, non-malicious perturbations, or by generating semantically equivalent paraphrases of input questions or statements. We further propose a novel metric for assessing a model robustness, and demonstrate its benefits in the non-adversarial scenario by empirical evaluation of several models on the created datasets.
Paper Structure (22 sections, 2 equations, 3 figures, 5 tables)

This paper contains 22 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of normalized Cohen's $h$ ($\tilde{\textrm{H}}$) and reverse PDR (=$-1{\times}\textrm{PDR}$) when the original instance accuracy $score_i^o{=}1.0$ (as in the tasks in our study -- binary evaluation outcome: 0 or 1).
  • Figure 2: Mean model accuracy on original datasets vs its undirectional robustness. x-axis: the higher, the better performing; y-axis: the lower, the more robust.
  • Figure 3: Mean metric scores by model and dataset. Red error bars show a 95% bootstrapped confidence interval.