Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback
Ananth Muppidi, Tarak Das, Sambaran Bandyopadhyay, Tripti Shukla, Dharun D A
TL;DR
The paper tackles reference-free evaluation of automatically generated multimodal presentations by introducing REFLEX, a framework that learns to score and provide actionable explanations for four content metrics (Coverage, Redundancy, Text-Image Alignment, Flow) using perturbation-based negative samples. It introduces RefSlides as a large, cross-domain dataset and leverages contrastive learning with LoRA-fine-tuned Phi3-Mini models to generate explanations and scores, without ground-truth presentations during inference. Results from automated and human evaluations show REFLEX outperforms heuristic baselines and state-of-the-art LLM-based evaluators (G-Eval, Phi3-Eval) in correlation with pseudo-scores and quality of explanations, while remaining reference-free. The work demonstrates scalable, interpretable evaluation for presentation content that can guide improvements, with implications for multimodal content evaluation and feedback in generative-AI-assisted slide generation.
Abstract
The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.
