Table of Contents
Fetching ...

Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation

Jong Inn Park, Maanas Taneja, Qianwen Wang, Dongyeop Kang

TL;DR

This work tackles the challenge of producing accurate, engaging short-form scientific videos by grounding outputs to source papers and figures. It introduces SciTalk, a creator-inspired, multi-agent framework that orchestrates preprocessing, planning, editing, and a feedback-evaluation loop to iteratively refine video generation. By employing specialized agents and vision-language feedback, SciTalk achieves more scientifically accurate and coherent videos than single-agent baselines, though still lags behind expert human creators in overall polish. The study provides early empirical insights into the benefits and challenges of feedback-driven, agent-based video generation for scientific dissemination, and makes code, data, and generated videos publicly available for future development.

Abstract

Generating engaging, accurate short-form videos from scientific papers is challenging due to content complexity and the gap between expert authors and readers. Existing end-to-end methods often suffer from factual inaccuracies and visual artifacts, limiting their utility for scientific dissemination. To address these issues, we propose SciTalk, a novel multi-LLM agentic framework, grounding videos in various sources, such as text, figures, visual styles, and avatars. Inspired by content creators' workflows, SciTalk uses specialized agents for content summarization, visual scene planning, and text and layout editing, and incorporates an iterative feedback mechanism where video agents simulate user roles to give feedback on generated videos from previous iterations and refine generation prompts. Experimental evaluations show that SciTalk outperforms simple prompting methods in generating scientifically accurate and engaging content over the refined loop of video generation. Although preliminary results are still not yet matching human creators' quality, our framework provides valuable insights into the challenges and benefits of feedback-driven video generation. Our code, data, and generated videos will be publicly available.

Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation

TL;DR

This work tackles the challenge of producing accurate, engaging short-form scientific videos by grounding outputs to source papers and figures. It introduces SciTalk, a creator-inspired, multi-agent framework that orchestrates preprocessing, planning, editing, and a feedback-evaluation loop to iteratively refine video generation. By employing specialized agents and vision-language feedback, SciTalk achieves more scientifically accurate and coherent videos than single-agent baselines, though still lags behind expert human creators in overall polish. The study provides early empirical insights into the benefits and challenges of feedback-driven, agent-based video generation for scientific dissemination, and makes code, data, and generated videos publicly available for future development.

Abstract

Generating engaging, accurate short-form videos from scientific papers is challenging due to content complexity and the gap between expert authors and readers. Existing end-to-end methods often suffer from factual inaccuracies and visual artifacts, limiting their utility for scientific dissemination. To address these issues, we propose SciTalk, a novel multi-LLM agentic framework, grounding videos in various sources, such as text, figures, visual styles, and avatars. Inspired by content creators' workflows, SciTalk uses specialized agents for content summarization, visual scene planning, and text and layout editing, and incorporates an iterative feedback mechanism where video agents simulate user roles to give feedback on generated videos from previous iterations and refine generation prompts. Experimental evaluations show that SciTalk outperforms simple prompting methods in generating scientifically accurate and engaging content over the refined loop of video generation. Although preliminary results are still not yet matching human creators' quality, our framework provides valuable insights into the challenges and benefits of feedback-driven video generation. Our code, data, and generated videos will be publicly available.

Paper Structure

This paper contains 19 sections, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Conceptual overview of the multi-agent video generation pipeline. The pipeline comprises four distinct stages: (1) Preprocessing Stage, where papers are scraped to extract raw text, figures, images, and screenshots; (2) Planning Stage, involving specialized assistants that generate detailed outputs (audio, flash-talk, scene-plan, avatar, background, etc.) to composite a video clip; (3) Editing Stage, which integrates visual effects, image layouts, and text components; and (4) Feedback & Evaluation Stage, assessing video components, reflecting the improvements in feedback to next prompts, and evaluating overall quality of the final video output. Notation: $\text{Prompt}_{i,j}$ denotes the prompt used by agent $i$ during the $j$-th iteration, where $i \in \{F, S, B, T, E, L\}$ corresponds to each generation agent.
  • Figure 2: Detailed workflow on how generation agents contribute to scene composition. Agents operate sequentially across four stages, producing parameters that are passed to the video composition module for final assembly.
  • Figure 3: Improvements on feedback metrics and prompts.
  • Figure 4: Average evaluation scores across iterations for both human and model evaluations. Shaded regions represent 95% confidence intervals. All score axes are standardized to a range between 1.75 and 4.75 for consistency across metrics; higher scores indicate better performance. The blue dashed line represents the model-only average score from a single-agent baseline.
  • Figure 5: Comparison of evaluation scores across three papers (Query, Context, and Speed) between SciTalk-generated and human Creator videos. Top: direct comparison of the 1st SciTalk iteration against Creator videos. Bottom: comparison of the mean score across the five iterations against Creator videos. Black error bars denote 95% confidence intervals.
  • ...and 1 more figures