Table of Contents
Fetching ...

Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques

Michela Lorandi, Anya Belz

TL;DR

The paper investigates the reproducibility of metric-based evaluations for controllable text generation by applying the QRA++ framework to reproduce a set of single- and multi-attribute CTG methods. It reveals strong reproducibility for Type II (correlation-based) and Type IV (qualitative findings) results, but more variability for Type I (single numerical scores), with toxicity-related metrics showing the poorest reproducibility. By detailing two reproduction modes and the necessity of own scripts for certain metrics, the work emphasizes reporting and methodological gaps that can hinder exact replication. Overall, QRA++ offers a principled way to compare reproducibility across studies and highlights practical considerations when rerunning metric-based CTG evaluations.

Abstract

Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.

Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques

TL;DR

The paper investigates the reproducibility of metric-based evaluations for controllable text generation by applying the QRA++ framework to reproduce a set of single- and multi-attribute CTG methods. It reveals strong reproducibility for Type II (correlation-based) and Type IV (qualitative findings) results, but more variability for Type I (single numerical scores), with toxicity-related metrics showing the poorest reproducibility. By detailing two reproduction modes and the necessity of own scripts for certain metrics, the work emphasizes reporting and methodological gaps that can hinder exact replication. Overall, QRA++ offers a principled way to compare reproducibility across studies and highlights practical considerations when rerunning metric-based CTG evaluations.

Abstract

Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
Paper Structure (12 sections, 6 tables)