Table of Contents
Fetching ...

Illuminating Blind Spots of Language Models with Targeted Agent-in-the-Loop Synthetic Data

Philip Lippmann, Matthijs T. J. Spaan, Jie Yang

TL;DR

This work proposes a novel approach to address blind spot mitigation through the use of intelligent agents -- either humans or large LMs -- as teachers to characterize UU-type errors by leveraging the generalization capabilities of intelligent agents.

Abstract

Language models (LMs) have achieved impressive accuracy across a variety of tasks but remain vulnerable to high-confidence misclassifications, also referred to as unknown unknowns (UUs). These UUs cluster into blind spots in the feature space, leading to significant risks in high-stakes applications. This is particularly relevant for smaller, lightweight LMs that are more susceptible to such errors. While the identification of UUs has been extensively studied, their mitigation remains an open challenge, including how to use identified UUs to eliminate unseen blind spots. In this work, we propose a novel approach to address blind spot mitigation through the use of intelligent agents -- either humans or large LMs -- as teachers to characterize UU-type errors. By leveraging the generalization capabilities of intelligent agents, we identify patterns in high-confidence misclassifications and use them to generate targeted synthetic samples to improve model robustness and reduce blind spots. We conduct an extensive evaluation of our method on three classification tasks and demonstrate its effectiveness in reducing the number of UUs, all while maintaining a similar level of accuracy. We find that the effectiveness of human computation has a high ceiling but is highly dependent on familiarity with the underlying task. Moreover, the cost gap between humans and LMs surpasses an order of magnitude, as LMs attain human-like generalization and generation performance while being more scalable.

Illuminating Blind Spots of Language Models with Targeted Agent-in-the-Loop Synthetic Data

TL;DR

This work proposes a novel approach to address blind spot mitigation through the use of intelligent agents -- either humans or large LMs -- as teachers to characterize UU-type errors by leveraging the generalization capabilities of intelligent agents.

Abstract

Language models (LMs) have achieved impressive accuracy across a variety of tasks but remain vulnerable to high-confidence misclassifications, also referred to as unknown unknowns (UUs). These UUs cluster into blind spots in the feature space, leading to significant risks in high-stakes applications. This is particularly relevant for smaller, lightweight LMs that are more susceptible to such errors. While the identification of UUs has been extensively studied, their mitigation remains an open challenge, including how to use identified UUs to eliminate unseen blind spots. In this work, we propose a novel approach to address blind spot mitigation through the use of intelligent agents -- either humans or large LMs -- as teachers to characterize UU-type errors. By leveraging the generalization capabilities of intelligent agents, we identify patterns in high-confidence misclassifications and use them to generate targeted synthetic samples to improve model robustness and reduce blind spots. We conduct an extensive evaluation of our method on three classification tasks and demonstrate its effectiveness in reducing the number of UUs, all while maintaining a similar level of accuracy. We find that the effectiveness of human computation has a high ceiling but is highly dependent on familiarity with the underlying task. Moreover, the cost gap between humans and LMs surpasses an order of magnitude, as LMs attain human-like generalization and generation performance while being more scalable.
Paper Structure (35 sections, 4 figures, 7 tables)

This paper contains 35 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: In a sentiment classification task, we begin with a UU resulting from a perturbation -- denoted by a cross in the feature space. This UU is then used to generate an initial hypothesis via abstraction through human computation or an LM. This abstraction hypothesis can then either by used to generate a synthetic samples that target the existing blind spot or to generate a new hypothesis via extrapolation, which in turn is then used to generate synthetic samples targeting an unseen blind spot.
  • Figure 2: Example of hypothesis generalization using abstraction for the IMDB dataset. The abstraction is performed by a human or LLM based on original and perturbed samples.
  • Figure 3: Workflow: (A) Obtain UUs from the validation set on the original finetuned model; (B) use UUs to extend the training data via generalization (\ref{['fig:generalization']}) and thus obtain a more robust model; (C) evaluate this retrained model. Adversarial perturbations in dotted box are optional.
  • Figure 4: Plots of prediction confidence per misclassified sample for BERT on QNLI dataset when using TF as a perturbation technique, showing the distribution across confidence bins. The distribution of the prediction confidences is altered by the retraining, regardless of how it was performed. Our method is able to lower the number of high-confidence classifications, especially those at the highest of confidences, improving model calibration.