Table of Contents
Fetching ...

PaperX: A Unified Framework for Multimodal Academic Presentation Generation with Scholar DAG

Tao Yu, Minghui Zhang, Zhiqing Cui, Hao Wang, Zhongtian Luo, Shenghua Chai, Junhao Gong, Yuzhao Peng, Yuxuan Zhou, Yujia Yang, Zhenghao Zhang, Haopeng Jin, Xinming Wang, Yufei Xiong, Jiabing Yang, Jiahao Yuan, Hanqing Wang, Hongzhu Yi, YiFan Zhang, Yan Huang, Liang Wang

TL;DR

PaperX tackles the fragmentation of automated scientific presentation generation by introducing Scholar DAG, a structured intermediate representation that separates semantic content from modality-specific rendering. The two-stage pipeline (Paper2DAG and DAG2Scholar) converts a paper into a hierarchical, cross-modal graph and then renders PPTs, posters, and PR content via modality-specific graph traversal and refinement. The framework achieves state-of-the-art content fidelity, aesthetics, and efficiency across PPTEval, Paper2Poster, and PRBench, and demonstrates extensibility through integration with external rendering systems. This approach promises scalable, coherent, and cost-effective dissemination of scientific knowledge across diverse formats.

Abstract

Transforming scientific papers into multimodal presentation content is essential for research dissemination but remains labor intensive. Existing automated solutions typically treat each format as an isolated downstream task, leading to redundant processing and semantic inconsistency. We introduce PaperX, a unified framework that models academic presentation generation as a structural transformation and rendering process. Central to our approach is the Scholar DAG, an intermediate representation that decouples the paper's logical structure from its final presentation syntax. By applying adaptive graph traversal strategies, PaperX generates diverse, high quality outputs from a single source. Comprehensive evaluations demonstrate that our framework achieves the state of the art performance in content fidelity and aesthetic quality while significantly improving cost efficiency compared to specialized single task agents.

PaperX: A Unified Framework for Multimodal Academic Presentation Generation with Scholar DAG

TL;DR

PaperX tackles the fragmentation of automated scientific presentation generation by introducing Scholar DAG, a structured intermediate representation that separates semantic content from modality-specific rendering. The two-stage pipeline (Paper2DAG and DAG2Scholar) converts a paper into a hierarchical, cross-modal graph and then renders PPTs, posters, and PR content via modality-specific graph traversal and refinement. The framework achieves state-of-the-art content fidelity, aesthetics, and efficiency across PPTEval, Paper2Poster, and PRBench, and demonstrates extensibility through integration with external rendering systems. This approach promises scalable, coherent, and cost-effective dissemination of scientific knowledge across diverse formats.

Abstract

Transforming scientific papers into multimodal presentation content is essential for research dissemination but remains labor intensive. Existing automated solutions typically treat each format as an isolated downstream task, leading to redundant processing and semantic inconsistency. We introduce PaperX, a unified framework that models academic presentation generation as a structural transformation and rendering process. Central to our approach is the Scholar DAG, an intermediate representation that decouples the paper's logical structure from its final presentation syntax. By applying adaptive graph traversal strategies, PaperX generates diverse, high quality outputs from a single source. Comprehensive evaluations demonstrate that our framework achieves the state of the art performance in content fidelity and aesthetic quality while significantly improving cost efficiency compared to specialized single task agents.
Paper Structure (25 sections, 9 equations, 9 figures, 6 tables)

This paper contains 25 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Compared with existing methods, PaperX is able to incorporate and present substantially richer academic content.
  • Figure 2: Scholar DAG construction and DAG-driven multimodal academic presentation generation.
  • Figure 3: Case 1. Panels (a) and (c) show the PPT slides before refinement, while panels (b) and (d) present the corresponding slides after refinement. As observed, slides in (a) and (c) suffer from issues such as element overlap, content overflow, uneven spatial distribution, and textual redundancy, whereas these issues are effectively resolved in (b) and (d).
  • Figure 4: Case 2. Panels (a) and (c) present the posters before refinement, while panels (b) and (d) show the corresponding posters after refinement. Compared to (a) and (c), which suffer from inefficient space utilization, the refined results in (b) and (d) effectively improve the overall layout efficiency.
  • Figure 5: Case 3. Panels (a) and (b) illustrate the PR content before refinement, whereas panels (c) and (d) depict the refined PR content. The pre-refinement examples show a mismatch with prevailing platform-specific writing styles, which is effectively addressed after refinement.
  • ...and 4 more figures