Prompting Large Language Models for Counterfactual Generation: An Empirical Study
Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, Tieyun Qian
TL;DR
This paper presents the first systematic evaluation framework for prompting large language models to generate counterfactuals across four core NLP tasks: sentiment analysis, natural language inference, named entity recognition, and relation extraction. It combines a causal-theoretic foundation (SCM, spurious correlations, interventions) with a three-step LLM-based counterfactual generation pipeline (causal word identification, label-controlled infilling, and data augmentation) and studies prompt design and backbones for augmentation (BERT, BART). Key findings show that while LLMs can produce high-quality counterfactuals in many cases, they struggle with complex tasks like relation extraction due to entity constraints and selection bias, and that alignment techniques (instruction tuning, RLHF) improve generation performance, whereas simply scaling model size does not. The study demonstrates that LLM-generated counterfactuals can notably boost small-model performance on SA and NLI, but require careful handling for RE and NER to avoid degrading results. These insights have practical implications for using LLMs as data-augmentation engines, emphasizing alignment and prompt design to realize robust counterfactual generation in real-world NLP systems.
Abstract
Large language models (LLMs) have made remarkable progress in a wide range of natural language understanding and generation tasks. However, their ability to generate counterfactuals has not been examined systematically. To bridge this gap, we present a comprehensive evaluation framework on various types of NLU tasks, which covers all key factors in determining LLMs' capability of generating counterfactuals. Based on this framework, we 1) investigate the strengths and weaknesses of LLMs as the counterfactual generator, and 2) disclose the factors that affect LLMs when generating counterfactuals, including both the intrinsic properties of LLMs and prompt designing. The results show that, though LLMs are promising in most cases, they face challenges in complex tasks like RE since they are bounded by task-specific performance, entity constraints, and inherent selection bias. We also find that alignment techniques, e.g., instruction-tuning and reinforcement learning from human feedback, may potentially enhance the counterfactual generation ability of LLMs. On the contrary, simply increasing the parameter size does not yield the desired improvements. Besides, from the perspective of prompt designing, task guidelines unsurprisingly play an important role. However, the chain-of-thought approach does not always help due to inconsistency issues.
