Table of Contents
Fetching ...

Prompting Large Language Models for Counterfactual Generation: An Empirical Study

Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, Tieyun Qian

TL;DR

This paper presents the first systematic evaluation framework for prompting large language models to generate counterfactuals across four core NLP tasks: sentiment analysis, natural language inference, named entity recognition, and relation extraction. It combines a causal-theoretic foundation (SCM, spurious correlations, interventions) with a three-step LLM-based counterfactual generation pipeline (causal word identification, label-controlled infilling, and data augmentation) and studies prompt design and backbones for augmentation (BERT, BART). Key findings show that while LLMs can produce high-quality counterfactuals in many cases, they struggle with complex tasks like relation extraction due to entity constraints and selection bias, and that alignment techniques (instruction tuning, RLHF) improve generation performance, whereas simply scaling model size does not. The study demonstrates that LLM-generated counterfactuals can notably boost small-model performance on SA and NLI, but require careful handling for RE and NER to avoid degrading results. These insights have practical implications for using LLMs as data-augmentation engines, emphasizing alignment and prompt design to realize robust counterfactual generation in real-world NLP systems.

Abstract

Large language models (LLMs) have made remarkable progress in a wide range of natural language understanding and generation tasks. However, their ability to generate counterfactuals has not been examined systematically. To bridge this gap, we present a comprehensive evaluation framework on various types of NLU tasks, which covers all key factors in determining LLMs' capability of generating counterfactuals. Based on this framework, we 1) investigate the strengths and weaknesses of LLMs as the counterfactual generator, and 2) disclose the factors that affect LLMs when generating counterfactuals, including both the intrinsic properties of LLMs and prompt designing. The results show that, though LLMs are promising in most cases, they face challenges in complex tasks like RE since they are bounded by task-specific performance, entity constraints, and inherent selection bias. We also find that alignment techniques, e.g., instruction-tuning and reinforcement learning from human feedback, may potentially enhance the counterfactual generation ability of LLMs. On the contrary, simply increasing the parameter size does not yield the desired improvements. Besides, from the perspective of prompt designing, task guidelines unsurprisingly play an important role. However, the chain-of-thought approach does not always help due to inconsistency issues.

Prompting Large Language Models for Counterfactual Generation: An Empirical Study

TL;DR

This paper presents the first systematic evaluation framework for prompting large language models to generate counterfactuals across four core NLP tasks: sentiment analysis, natural language inference, named entity recognition, and relation extraction. It combines a causal-theoretic foundation (SCM, spurious correlations, interventions) with a three-step LLM-based counterfactual generation pipeline (causal word identification, label-controlled infilling, and data augmentation) and studies prompt design and backbones for augmentation (BERT, BART). Key findings show that while LLMs can produce high-quality counterfactuals in many cases, they struggle with complex tasks like relation extraction due to entity constraints and selection bias, and that alignment techniques (instruction tuning, RLHF) improve generation performance, whereas simply scaling model size does not. The study demonstrates that LLM-generated counterfactuals can notably boost small-model performance on SA and NLI, but require careful handling for RE and NER to avoid degrading results. These insights have practical implications for using LLMs as data-augmentation engines, emphasizing alignment and prompt design to realize robust counterfactual generation in real-world NLP systems.

Abstract

Large language models (LLMs) have made remarkable progress in a wide range of natural language understanding and generation tasks. However, their ability to generate counterfactuals has not been examined systematically. To bridge this gap, we present a comprehensive evaluation framework on various types of NLU tasks, which covers all key factors in determining LLMs' capability of generating counterfactuals. Based on this framework, we 1) investigate the strengths and weaknesses of LLMs as the counterfactual generator, and 2) disclose the factors that affect LLMs when generating counterfactuals, including both the intrinsic properties of LLMs and prompt designing. The results show that, though LLMs are promising in most cases, they face challenges in complex tasks like RE since they are bounded by task-specific performance, entity constraints, and inherent selection bias. We also find that alignment techniques, e.g., instruction-tuning and reinforcement learning from human feedback, may potentially enhance the counterfactual generation ability of LLMs. On the contrary, simply increasing the parameter size does not yield the desired improvements. Besides, from the perspective of prompt designing, task guidelines unsurprisingly play an important role. However, the chain-of-thought approach does not always help due to inconsistency issues.
Paper Structure (48 sections, 2 equations, 8 figures, 26 tables)

This paper contains 48 sections, 2 equations, 8 figures, 26 tables.

Figures (8)

  • Figure 1: (a) Structural causal model of the SA task, (b) Intervention operation.
  • Figure 2: Left: The proposed framework for evaluating counterfactuals generated by LLMs (SA task). Right: Original (OG) samples and generated counterfactual (CF) samples on SA, NLI, NER and RE tasks.
  • Figure 3: Performance comparison under few-shot settings. The LLMs refer to GPT-3.5. The results of SLMs are obtained by averaging the performance of BERT-based and BART-based fine-tuned models.
  • Figure 4: Task-specific performance (left) of LLMs and augmentation effects on SLMs (right).
  • Figure 5: Reasons that lead to unreasonable counterfactuals and corresponding proportions.
  • ...and 3 more figures