Table of Contents
Fetching ...

PodAgent: A Comprehensive Framework for Podcast Generation

Yujia Xiao, Lei He, Haohan Guo, Fenglong Xie, Tan Lee

TL;DR

PodAgent tackles automated podcast-like audio generation by integrating a Host-Guest-Writer multi-agent framework, a voice-role matching system, and LLM-guided instruction-following speech synthesis to produce content-rich, expressive long-form dialogue. It introduces a comprehensive evaluation protocol for open-ended podcast content and voice quality, and demonstrates substantial improvements over direct GPT-4 generation in topic-discussion content and voice alignment, achieving an 87.4% voice-matching accuracy. The approach enables end-to-end production from topic to audio, with validated benefits in dialogue diversity, informativeness, and expressiveness, while acknowledging limitations in voice pool size and acoustic realism. Overall, PodAgent offers a scalable blueprint for automatic podcast creation with practical implications for content automation, voice replication ethics, and AI-assisted media production.

Abstract

Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.

PodAgent: A Comprehensive Framework for Podcast Generation

TL;DR

PodAgent tackles automated podcast-like audio generation by integrating a Host-Guest-Writer multi-agent framework, a voice-role matching system, and LLM-guided instruction-following speech synthesis to produce content-rich, expressive long-form dialogue. It introduces a comprehensive evaluation protocol for open-ended podcast content and voice quality, and demonstrates substantial improvements over direct GPT-4 generation in topic-discussion content and voice alignment, achieving an 87.4% voice-matching accuracy. The approach enables end-to-end production from topic to audio, with validated benefits in dialogue diversity, informativeness, and expressiveness, while acknowledging limitations in voice pool size and acoustic realism. Overall, PodAgent offers a scalable blueprint for automatic podcast creation with practical implications for content automation, voice replication ethics, and AI-assisted media production.

Abstract

Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.

Paper Structure

This paper contains 23 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An overview of human-made Podcasts (Above) and the Generative models based PodAgent (Below): LLMs / TTS / TTA are used to generate conversation scripts, speech, sound effect and music.
  • Figure 2: The Workflow of Host-Guest-Writer System. Left: The Host-agent generates guest information and an interview outline based on the given topic and number of guests. Middle: Guest-agents respond to the interview outlines, offering specialized perspectives aligned with their assigned roles. Right: The Writer-agent compiles a complete and coherent conversation script using the gathered interview material.
  • Figure 3: Voice-Role Matching. Left: Construction of a voice library through characteristic analysis of diverse speech segments. Right: The Matching-agent performs voice-role pairing using the voice library, guest profiles, and interview structure.
  • Figure 4: From Conversation Script to Podcast. Audio Script Generation: The Writer-agent create the audio script by enriching the conversation script with sound effect and music. Instruction-following TTS: Speaking styles are generated along with the conversation script, which can be used as instruction to guide the expressive speech synthesis. Audio Production: The generated audio segments are combined to create the final podcast.
  • Figure 5: Prompt for GPT-4 Evaluator.
  • ...and 2 more figures