Table of Contents
Fetching ...

Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam

Abstract

Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm

Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Abstract

Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm
Paper Structure (43 sections, 14 equations, 11 figures, 13 tables)

This paper contains 43 sections, 14 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Architecture of the proposed system. The LLM agent working in the training loop generates tool calls that are executed in the environment, with multi-component rewards guiding policy optimization.
  • Figure 2: Expert trajectory generation pipeline. The expert LLM generates a tool call each turn, which is executed in the environment. Step rewards are computed as quality deltas after each action, and the conversation history accumulates until the episode terminates.
  • Figure 3: Architecture of the base Qwen2.5-Coder-7B-Instruct model. All 7.62B parameters are frozen and stored in 4-bit quantized format. The model comprises 28 transformer decoder layers, each containing Grouped-Query Attention (28 query heads, 4 KV heads, head dim 128) and a SwiGLU feed-forward network (intermediate dim 18,944). Legend: $\ast$ frozen layers, $\ast$ trainable layers.
  • Figure 4: Architecture of the GRPO-finetuned SlideRL model. LoRA adapters (rank $r{=}16$) are injected into all 7 linear projections per layer---Q, K, V, O (attention) and gate, up, down (FFN)---adding 1.44M trainable parameters per layer (40.4M total, 0.53% of 7.62B). Base weights remain frozen in 4-bit; only the LoRA matrices (bfloat16) are updated during GRPO training. Legend: $\ast$ frozen layers, $\ast$ trainable layers.
  • Figure 5: Model ranking by overall quality.
  • ...and 6 more figures