Perturbed examples reveal invariances shared by language models

Ruchit Rawal; Mariya Toneva

Perturbed examples reveal invariances shared by language models

Ruchit Rawal, Mariya Toneva

TL;DR

This work addresses the challenge of comparing NLP models beyond IID benchmarks by introducing a shared-invariances framework that uses interpretable perturbations targeting specific linguistic capabilities. It defines goal-conditioned perturbation generation and two invariance-based metrics, Hard-SCoPE and Soft-SCoPE, to quantify how much a target model preserves perturbation invariances established by a reference model. Through experiments across architectures and with black-box APIs (e.g., InstructGPT vs GPT-2), the authors show that larger models tend to share more invariances, while distillation can weaken invariances for certain capabilities. The framework provides insights into how design choices and pretraining influence linguistic capabilities, offering a practical tool for nuanced model evaluation and safer deployment of AI systems. The authors also release code to enable reproducibility and further exploration in this area.

Abstract

The rapid growth in natural language processing (NLP) research has led to numerous new models, outpacing our understanding of how they compare to established ones. One major reason for this difficulty is saturating benchmarks, which may not well reflect differences in model performance in the wild. In this work, we introduce a novel framework to compare two NLP models by revealing their shared invariance to interpretable input perturbations targeting a specific linguistic capability. Via experiments on models from the same and different architecture families, this framework offers insights about how changes in models (e.g., distillation, size increase) affect linguistic capabilities. Furthermore, our framework enables evaluation of invariances between commercial black-box models (e.g., InstructGPT family) and models that are better understood (e.g., GPT-2). Across experiments, we observe that large language models share many invariances encoded by models of various sizes, whereas the invariances by large models are only shared by other large models. Possessing a wide variety of invariances may be key to the recent successes of large language models, and our framework can shed light on the types of invariances retained or emerging in new models. We make the code publicly available.

Perturbed examples reveal invariances shared by language models

TL;DR

Abstract

Paper Structure (40 sections, 5 equations, 19 figures, 3 tables)

This paper contains 40 sections, 5 equations, 19 figures, 3 tables.

Introduction
Related Works
Methodology
Goal Function and Search Method
Transformations and Constraints
Metrics for Quantifying Behavioral-Similarity
Notation:
Performance-based Metrics
Agreement-based Metrics
Proposed Invariance-based Metrics
Hard-SCoPE:
Soft-SCoPE:
Effect of Model Design Choices on Shared-Invariances
Different Linguistic Capabilities
Gap in IID accuracy may overestimate the degree of shared invariances:
...and 25 more sections

Figures (19)

Figure 1: Proposed shared invariances metrics: Hard-SCoPE and Soft-SCoPE, for three binary-classifiers ($\textcolor{customblue}{m_1}$, $\textcolor{custombrown}{m_2}$, and $\textcolor{custompink}{m_3}$). For perturbation $x \rightarrow x’$, both $\textcolor{custombrown}{m_2}$ and $\textcolor{custompink}{m_3}$ satisfy the Hard-SCoPE criteria. However, the effect of the perturbation is more aligned for $\textcolor{customblue}{m_1}$ & $\textcolor{custompink}{m_3}$ compared to $\textcolor{customblue}{m_1}$ & $\textcolor{custombrown}{m_2}$.
Figure 2: [Reference Model: BERT, Target Model: DistilBERT]. Comparing shared-invariances between DistilBERT and BERT on Synonym-Invariance and Typo-Invariance defined w.r.t BERT. Distillation hurts some capabilities (Typo-Invariance) substantially more than others (Synonym-Invariance).
Figure 3: [Linguistic-Capability: Synonym-Invariance] Analyzing the effect of size on shared-invariances within the BERT architecture family. The OOD-agreement is higher for target models in similar size ranges as the reference model. However, shared-invariances are higher for target models of larger size irrespective of the reference model.
Figure 4: [Reference Model: GPT-2, Capability: Synonym-Invariance]. Comparing shared-invariances between GPT-2 and various OpenAI models differing in size and finetuning along Synonym-Invariance. Larger InstructGPT models share more invariances with GPT-2. Also, state-of-the-art models finetuned with reinforcement learning (text-davinci-003) share more invariances than their supervised finetuned counterparts (text-davinci-002).
Figure 5: [Dataset: AG’s News, Reference Model: BERT, Target Model: DistilBERT]. Comparing shared-invariances between DistilBERT and BERT on Synonym-Invariance and Typo-Invariance defined w.r.t BERT trained on AG’s news dataset. Similar to our observations for SST2 in the main paper, we observe that distillation hurts some capabilities (Typo-Invariance) substantially more than others (Synonym-Invariance).
...and 14 more figures

Perturbed examples reveal invariances shared by language models

TL;DR

Abstract

Perturbed examples reveal invariances shared by language models

Authors

TL;DR

Abstract

Table of Contents

Figures (19)