Table of Contents
Fetching ...

Counterfactual Edits for Generative Evaluation

Maria Lymperaiou, Giorgos Filandrianos, Konstantinos Thomas, Giorgos Stamou

TL;DR

This work addresses the brittleness of pixel-based evaluation for generative models by introducing a knowledge-driven, black-box framework that uses concept-level counterfactual edits. It defines the Concept Set Edit Distance (CSED) and three edit operations (Insertion, Deletion, Replacement) to measure the minimal changes required to transform generated concepts into ground-truth conditioning, enabling both faithful and consistent evaluation. The framework supports two challenging tasks—Story Visualization and Scene Generation—through local explanations (edit paths) and global explanations (frequent edit rules), providing interpretable insights into model capabilities and biases. Experiments on CLEVR-SV and COCO-based SG with state-of-the-art diffusion models show alignment with traditional metrics while offering richer diagnostics about semantic generation, revealing areas for architectural or training improvements and guiding future research in explainable generative evaluation. The approach offers practical impact by enabling model-agnostic evaluation and actionable feedback on which concepts are hard for a model to generate or preserve across frames or scenes, enhancing robustness and interpretability of generative systems.

Abstract

Evaluation of generative models has been an underrepresented field despite the surge of generative architectures. Most recent models are evaluated upon rather obsolete metrics which suffer from robustness issues, while being unable to assess more aspects of visual quality, such as compositionality and logic of synthesis. At the same time, the explainability of generative models remains a limited, though important, research direction with several current attempts requiring access to the inner functionalities of generative models. Contrary to prior literature, we view generative models as a black box, and we propose a framework for the evaluation and explanation of synthesized results based on concepts instead of pixels. Our framework exploits knowledge-based counterfactual edits that underline which objects or attributes should be inserted, removed, or replaced from generated images to approach their ground truth conditioning. Moreover, global explanations produced by accumulating local edits can also reveal what concepts a model cannot generate in total. The application of our framework on various models designed for the challenging tasks of Story Visualization and Scene Synthesis verifies the power of our approach in the model-agnostic setting.

Counterfactual Edits for Generative Evaluation

TL;DR

This work addresses the brittleness of pixel-based evaluation for generative models by introducing a knowledge-driven, black-box framework that uses concept-level counterfactual edits. It defines the Concept Set Edit Distance (CSED) and three edit operations (Insertion, Deletion, Replacement) to measure the minimal changes required to transform generated concepts into ground-truth conditioning, enabling both faithful and consistent evaluation. The framework supports two challenging tasks—Story Visualization and Scene Generation—through local explanations (edit paths) and global explanations (frequent edit rules), providing interpretable insights into model capabilities and biases. Experiments on CLEVR-SV and COCO-based SG with state-of-the-art diffusion models show alignment with traditional metrics while offering richer diagnostics about semantic generation, revealing areas for architectural or training improvements and guiding future research in explainable generative evaluation. The approach offers practical impact by enabling model-agnostic evaluation and actionable feedback on which concepts are hard for a model to generate or preserve across frames or scenes, enhancing robustness and interpretability of generative systems.

Abstract

Evaluation of generative models has been an underrepresented field despite the surge of generative architectures. Most recent models are evaluated upon rather obsolete metrics which suffer from robustness issues, while being unable to assess more aspects of visual quality, such as compositionality and logic of synthesis. At the same time, the explainability of generative models remains a limited, though important, research direction with several current attempts requiring access to the inner functionalities of generative models. Contrary to prior literature, we view generative models as a black box, and we propose a framework for the evaluation and explanation of synthesized results based on concepts instead of pixels. Our framework exploits knowledge-based counterfactual edits that underline which objects or attributes should be inserted, removed, or replaced from generated images to approach their ground truth conditioning. Moreover, global explanations produced by accumulating local edits can also reveal what concepts a model cannot generate in total. The application of our framework on various models designed for the challenging tasks of Story Visualization and Scene Synthesis verifies the power of our approach in the model-agnostic setting.
Paper Structure (31 sections, 11 equations, 3 figures, 9 tables)

This paper contains 31 sections, 11 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Outline of the proposed framework.
  • Figure 2: Ground truth vs generated CLEVR-SV story frames using impartial for L=4.
  • Figure 3: An image sample generated by Stable Diffusion 2 to extract local explanations.