CEval: A Benchmark for Evaluating Counterfactual Text Generation
Van Bach Nguyen, Jörg Schlötterer, Christin Seifert
TL;DR
CEval tackles the fragmented evaluation landscape of textual counterfactual generation by introducing a unified benchmark that combines counterfactual efficacy metrics with text-quality assessment, supported by curated datasets and common baselines. It formalizes the task, defines clear metrics such as Flip Rate ($FR$) and Probability Change ($\Delta P$), and validates the approach using IMDB and SNLI with both automatic and human annotations. Key findings show no single method excels across all dimensions: classifier-access methods flip labels effectively but at the cost of quality, while LLMs and human edits produce superior text quality but weaker counterfactual performance; open-source LLMs like Mistral-7B can reasonably substitute for costlier closed models in evaluation. CEval's open-source Python library and multi-model evaluation framework offer a practical pathway to robust, reproducible, and future-proof benchmarking for counterfactual text generation, with a view toward hybrid approaches and improved classifier calibration.
Abstract
Counterfactual text generation aims to minimally change a text, such that it is classified differently. Judging advancements in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark for comparing counterfactual text generation methods. CEval unifies counterfactual and text quality metrics, includes common counterfactual datasets with human annotations, standard baselines (MICE, GDBA, CREST) and the open-source language model LLAMA-2. Our experiments found no perfect method for generating counterfactual text. Methods that excel at counterfactual metrics often produce lower-quality text while LLMs with simple prompts generate high-quality text but struggle with counterfactual criteria. By making CEval available as an open-source Python library, we encourage the community to contribute more methods and maintain consistent evaluation in future work.
