CEval: A Benchmark for Evaluating Counterfactual Text Generation

Van Bach Nguyen; Jörg Schlötterer; Christin Seifert

CEval: A Benchmark for Evaluating Counterfactual Text Generation

Van Bach Nguyen, Jörg Schlötterer, Christin Seifert

TL;DR

CEval tackles the fragmented evaluation landscape of textual counterfactual generation by introducing a unified benchmark that combines counterfactual efficacy metrics with text-quality assessment, supported by curated datasets and common baselines. It formalizes the task, defines clear metrics such as Flip Rate ($FR$) and Probability Change ($\Delta P$), and validates the approach using IMDB and SNLI with both automatic and human annotations. Key findings show no single method excels across all dimensions: classifier-access methods flip labels effectively but at the cost of quality, while LLMs and human edits produce superior text quality but weaker counterfactual performance; open-source LLMs like Mistral-7B can reasonably substitute for costlier closed models in evaluation. CEval's open-source Python library and multi-model evaluation framework offer a practical pathway to robust, reproducible, and future-proof benchmarking for counterfactual text generation, with a view toward hybrid approaches and improved classifier calibration.

Abstract

Counterfactual text generation aims to minimally change a text, such that it is classified differently. Judging advancements in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark for comparing counterfactual text generation methods. CEval unifies counterfactual and text quality metrics, includes common counterfactual datasets with human annotations, standard baselines (MICE, GDBA, CREST) and the open-source language model LLAMA-2. Our experiments found no perfect method for generating counterfactual text. Methods that excel at counterfactual metrics often produce lower-quality text while LLMs with simple prompts generate high-quality text but struggle with counterfactual criteria. By making CEval available as an open-source Python library, we encourage the community to contribute more methods and maintain consistent evaluation in future work.

CEval: A Benchmark for Evaluating Counterfactual Text Generation

TL;DR

) and Probability Change (

), and validates the approach using IMDB and SNLI with both automatic and human annotations. Key findings show no single method excels across all dimensions: classifier-access methods flip labels effectively but at the cost of quality, while LLMs and human edits produce superior text quality but weaker counterfactual performance; open-source LLMs like Mistral-7B can reasonably substitute for costlier closed models in evaluation. CEval's open-source Python library and multi-model evaluation framework offer a practical pathway to robust, reproducible, and future-proof benchmarking for counterfactual text generation, with a view toward hybrid approaches and improved classifier calibration.

Abstract

Paper Structure (22 sections, 5 equations, 5 figures, 9 tables)

This paper contains 22 sections, 5 equations, 5 figures, 9 tables.

Introduction
Related Work
Benchmark Design
Metrics
Counterfactual metrics
Text Quality Metrics
Datasets and Classifiers
Counterfactual Methods Selection
Results
There is no single best method.
Diversity and distance are correlated.
Probability changes are mostly bimodal.
Generated texts exhibit substantial differences.
Temperature affects counterfactual generation diversity
Comparison of LLMs for Text Quality Evaluation
...and 7 more sections

Figures (5)

Figure 1: Examples of counterfactuals generated by different methods and human annotators that successfully flip the label from negative to positive for the same original instance.
Figure 2: Distribution of target label probabilities of all methods on the IMDB dataset, including original text and human groups.
Figure 3: Avg. pairwise Levenshtein distance on IMDB.
Figure 4: Pearson correlation between Mistral and ChatGPT in text quality evaluation with different temperatures (0.2 and 1.0) on the IMDB dataset. The same model with the different temperatures exhibits a strong correlation, meanwhile different models show a moderate correlation in evaluating text quality for counterfactual generation.
Figure 5: Pearson correlation between Mistral and ChatGPT in text quality evaluation with different temperatures (0.2 and 1.0) on the SNLI dataset. Text quality evaluation results of the same model with the different temperatures are strongly correlated; results from different models are moderately correlated.

CEval: A Benchmark for Evaluating Counterfactual Text Generation

TL;DR

Abstract

CEval: A Benchmark for Evaluating Counterfactual Text Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)