PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
Yujia Xiao, Liumeng Xue, Lei He, Xinyi Chen, Aemon Yat Fei Chiu, Wenjie Tian, Shaofei Zhang, Qiuqiang Kong, Xinfa Zhu, Wei Xue, Tan Lee
TL;DR
PodEval tackles the challenge of evaluating open-ended, long-form podcast-like audio by introducing a multimodal framework that decomposes evaluation into text, speech, and audio with content- and format-focused metrics. It provides the Real-Pod dataset as a broad, real-world reference and combines objective metrics with structured subjective tests (including a MUSHRA-inspired dialogue naturalness assessment and justification-enabled questionnaire MOS) to capture both technical quality and user-perceived experience. The framework is demonstrated with diverse podcast-generation systems (open-source, closed-source, and human-made), uncovering strengths and gaps across dialogue scripting, voice consistency, and audio harmony, and is released openly to promote reproducibility and community-driven improvements. Overall, PodEval offers a rigorous, extensible pathway to evaluate open-ended long-form audio generation and guide future improvements in multimodal podcast synthesis.
Abstract
Recently, an increasing number of multimodal (text and audio) benchmarks have emerged, primarily focusing on evaluating models' understanding capability. However, exploration into assessing generative capabilities remains limited, especially for open-ended long-form content generation. Significant challenges lie in no reference standard answer, no unified evaluation metrics and uncontrollable human judgments. In this work, we take podcast-like audio generation as a starting point and propose PodEval, a comprehensive and well-designed open-source evaluation framework. In this framework: 1) We construct a real-world podcast dataset spanning diverse topics, serving as a reference for human-level creative quality. 2) We introduce a multimodal evaluation strategy and decompose the complex task into three dimensions: text, speech and audio, with different evaluation emphasis on "Content" and "Format". 3) For each modality, we design corresponding evaluation methods, involving both objective metrics and subjective listening test. We leverage representative podcast generation systems (including open-source, close-source, and human-made) in our experiments. The results offer in-depth analysis and insights into podcast generation, demonstrating the effectiveness of PodEval in evaluating open-ended long-form audio. This project is open-source to facilitate public use: https://github.com/yujxx/PodEval.
