Table of Contents
Fetching ...

People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection

Indira Sen, Dennis Assenmacher, Mattia Samory, Isabelle Augenstein, Wil van der Aalst, Claudia Wagner

TL;DR

This work evaluates whether counterfactually augmented data (CADs) can improve robustness in harmful language detection. It systematically contrasts manual CADs with automated CADs generated by Polyjuice, ChatGPT, and Flan-T5 across sexism and hate speech tasks, using both in-domain and diverse out-of-domain test sets. The results show manual CADs provide the strongest out-of-domain gains, with ChatGPT CADs offering a strong near-parallel performance, while Polyjuice and Flan-T5 CADs often underperform due to insufficient label flipping. The authors further analyze CAD properties via instance-level difficulty metrics (PVI) and find that edit type, semantic similarity, and generator source significantly influence learnability, underscoring the need for human vetting and hybrid human-AI CAD strategies. Overall, combining manual and automated CADs (amCAD) yields the best generalization, though risks such as bias from identity terms remain, motivating future human-in-the-loop approaches and prompt design to better align automated CADs with human quality standards.

Abstract

NLP models are used in a variety of critical social computing tasks, such as detecting sexist, racist, or otherwise hateful content. Therefore, it is imperative that these models are robust to spurious features. Past work has attempted to tackle such spurious features using training data augmentation, including Counterfactually Augmented Data (CADs). CADs introduce minimal changes to existing training data points and flip their labels; training on them may reduce model dependency on spurious features. However, manually generating CADs can be time-consuming and expensive. Hence in this work, we assess if this task can be automated using generative NLP models. We automatically generate CADs using Polyjuice, ChatGPT, and Flan-T5, and evaluate their usefulness in improving model robustness compared to manually-generated CADs. By testing both model performance on multiple out-of-domain test sets and individual data point efficacy, our results show that while manual CADs are still the most effective, CADs generated by ChatGPT come a close second. One key reason for the lower performance of automated methods is that the changes they introduce are often insufficient to flip the original label.

People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection

TL;DR

This work evaluates whether counterfactually augmented data (CADs) can improve robustness in harmful language detection. It systematically contrasts manual CADs with automated CADs generated by Polyjuice, ChatGPT, and Flan-T5 across sexism and hate speech tasks, using both in-domain and diverse out-of-domain test sets. The results show manual CADs provide the strongest out-of-domain gains, with ChatGPT CADs offering a strong near-parallel performance, while Polyjuice and Flan-T5 CADs often underperform due to insufficient label flipping. The authors further analyze CAD properties via instance-level difficulty metrics (PVI) and find that edit type, semantic similarity, and generator source significantly influence learnability, underscoring the need for human vetting and hybrid human-AI CAD strategies. Overall, combining manual and automated CADs (amCAD) yields the best generalization, though risks such as bias from identity terms remain, motivating future human-in-the-loop approaches and prompt design to better align automated CADs with human quality standards.

Abstract

NLP models are used in a variety of critical social computing tasks, such as detecting sexist, racist, or otherwise hateful content. Therefore, it is imperative that these models are robust to spurious features. Past work has attempted to tackle such spurious features using training data augmentation, including Counterfactually Augmented Data (CADs). CADs introduce minimal changes to existing training data points and flip their labels; training on them may reduce model dependency on spurious features. However, manually generating CADs can be time-consuming and expensive. Hence in this work, we assess if this task can be automated using generative NLP models. We automatically generate CADs using Polyjuice, ChatGPT, and Flan-T5, and evaluate their usefulness in improving model robustness compared to manually-generated CADs. By testing both model performance on multiple out-of-domain test sets and individual data point efficacy, our results show that while manual CADs are still the most effective, CADs generated by ChatGPT come a close second. One key reason for the lower performance of automated methods is that the changes they introduce are often insufficient to flip the original label.
Paper Structure (33 sections, 9 figures, 12 tables)

This paper contains 33 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: The performance of different types of RoBERTa models for detecting sexism and hate speech on different types of test sets, including the macro average on all out-of-domain test sets, called "all OOD+HC" [RQ1]. Note that aCADGPT is a finetuned RoBERTa model on the CADs generated by ChatGPT, while FSGPT denotes the few-shot classification labels from ChatGPT. Models trained on manual CADs (mCAD) have low in-domain performance compared to non-counterfactual models but higher performance out-of-domain for virtually all OOD datasets. The manual CADs are better than the automated ones, but CADs from ChatGPT come close to matching their performance. Manual CADs are most effective for hate speech, while for the sexism models, a mix of manual and different types of automated CADs yields the best OOD performance among the finetuned models. (Statistically significant using McNemar's test, see Appendix \ref{['sec:stat_sig']}) Few-shot labels from ChatGPT (FSGPT) perform well, especially for sexism and for the hate check datasets, however, with high variance across different datasets.
  • Figure 2: Distribution of PVI scores for the training set containing original data and CADs. For both tasks, CADs from Polyjuice and Flan-T5 have the lowest PVI scores indicating they are the hardest-to-learn.
  • Figure 3: The performance of different types of Flan-T5 models for detecting sexism and hate speech measured using macro F1. For both sexism and hate speech, models trained on CADs outperform models trained on just original data. For sexism, manual, chatGPT, and a mixture of CADs perform best, while manual CADs are the best for hate speech.
  • Figure 4: Distribution of OOD test sets' PVI scores for models trained on datasets with and without CADs. For both sexism and hate speech, models trained on original data and CADs have higher average PVI scores (V-information) on OOD test sets compared to models trained on just original data, implying that training on CADs makes the OOD dataset easier to learn.
  • Figure 5: The performance of different types of SVM models for detecting sexism and hate speech measured using macro F1. For both sexism and hate speech, models trained on CADs outperform models trained on just original data. For sexism, manual, chatGPT, and a mixture of CADs perform best, while manual CADs are the best for hate speech.
  • ...and 4 more figures