Textual-to-Visual Iterative Self-Verification for Slide Generation
Yunqing Xu, Xinbei Ma, Jiyang Qiu, Hai Zhao
TL;DR
This work tackles automating academic slide generation by decomposing the task into content and layout generation and introducing a textual-to-visual iterative self-verification loop. The method uses a three-part content pipeline (Text Retriever, Figure Extractor, Content Generator) followed by a two-stage layout process (Initial Layout Draft and Textual-to-Visual Refinement) powered by a Reviewer + Refiner cycle. Empirical results show improvements in content alignment, logical flow, visual appeal, and readability over baselines, with GPT-4o delivering strong content performance and the modality-transformed refinement enhancing layout quality. The approach enables more efficient, high-quality slide generation and can be extended to complete multi-page presentations, offering practical value for researchers and educators.
Abstract
Generating presentation slides is a time-consuming task that urgently requires automation. Due to their limited flexibility and lack of automated refinement mechanisms, existing autonomous LLM-based agents face constraints in real-world applicability. We decompose the task of generating missing presentation slides into two key components: content generation and layout generation, aligning with the typical process of creating academic slides. First, we introduce a content generation approach that enhances coherence and relevance by incorporating context from surrounding slides and leveraging section retrieval strategies. For layout generation, we propose a textual-to-visual self-verification process using a LLM-based Reviewer + Refiner workflow, transforming complex textual layouts into intuitive visual formats. This modality transformation simplifies the task, enabling accurate and human-like review and refinement. Experiments show that our approach significantly outperforms baseline methods in terms of alignment, logical flow, visual appeal, and readability.
