Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation
Jong Inn Park, Maanas Taneja, Qianwen Wang, Dongyeop Kang
TL;DR
This work tackles the challenge of producing accurate, engaging short-form scientific videos by grounding outputs to source papers and figures. It introduces SciTalk, a creator-inspired, multi-agent framework that orchestrates preprocessing, planning, editing, and a feedback-evaluation loop to iteratively refine video generation. By employing specialized agents and vision-language feedback, SciTalk achieves more scientifically accurate and coherent videos than single-agent baselines, though still lags behind expert human creators in overall polish. The study provides early empirical insights into the benefits and challenges of feedback-driven, agent-based video generation for scientific dissemination, and makes code, data, and generated videos publicly available for future development.
Abstract
Generating engaging, accurate short-form videos from scientific papers is challenging due to content complexity and the gap between expert authors and readers. Existing end-to-end methods often suffer from factual inaccuracies and visual artifacts, limiting their utility for scientific dissemination. To address these issues, we propose SciTalk, a novel multi-LLM agentic framework, grounding videos in various sources, such as text, figures, visual styles, and avatars. Inspired by content creators' workflows, SciTalk uses specialized agents for content summarization, visual scene planning, and text and layout editing, and incorporates an iterative feedback mechanism where video agents simulate user roles to give feedback on generated videos from previous iterations and refine generation prompts. Experimental evaluations show that SciTalk outperforms simple prompting methods in generating scientifically accurate and engaging content over the refined loop of video generation. Although preliminary results are still not yet matching human creators' quality, our framework provides valuable insights into the challenges and benefits of feedback-driven video generation. Our code, data, and generated videos will be publicly available.
