Table of Contents
Fetching ...

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr

TL;DR

The paper addresses converting long, multimodal scientific papers into single-page posters via a benchmark and a multi-agent system. It introduces Paper2Poster with four evaluation dimensions including PaperQuiz to assess content conveyed by posters. PosterAgent uses a top-down Parser-Planner-Painter–Commenter pipeline to produce editable PPTX posters and reduces token usage with open-source backbones. Results show open-source implementations approach the quality of GPT-4o baselines in most metrics and reveal that reader engagement is a key aesthetic bottleneck.

Abstract

Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

TL;DR

The paper addresses converting long, multimodal scientific papers into single-page posters via a benchmark and a multi-agent system. It introduces Paper2Poster with four evaluation dimensions including PaperQuiz to assess content conveyed by posters. PosterAgent uses a top-down Parser-Planner-Painter–Commenter pipeline to produce editable PPTX posters and reduces token usage with open-source backbones. Results show open-source implementations approach the quality of GPT-4o baselines in most metrics and reveal that reader engagement is a key aesthetic bottleneck.

Abstract

Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.

Paper Structure

This paper contains 48 sections, 10 equations, 29 figures, 11 tables.

Figures (29)

  • Figure 1: Overview of this work. We address two core challenges in scientific poster generation: Left: How to create a poster from a paper—we propose PosterAgent (Sec. \ref{['sec:posteragent']}), a framework that transforms long-context scientific papers (20K+ tokens) into structured visual posters; and Right: How to evaluate poster quality—we introduce the Paper2Poster benchmark (Sec. \ref{['sec:paper2poster']}), which enables systematic comparison between agent-generated and author-designed posters.
  • Figure 2: Data Statistics of Paper2Poster. (a) Word cloud illustrating the diversity of research topics. (b) Textual Token statistics and Figure count statistics for input papers vs. posters provided by authors. Overall, these statistics highlight that Paper2Poster is a multimodal context compression task, requiring effective abstraction of both textual and visual content.
  • Figure 3: Left: Overview of the evaluation framework in Paper2Poster. Middle: We automatically generate multiple-choice questions from each paper using an LLM (o3), forming the our PaperQuiz evaluation. Right: In PaperQuiz, we simulate multiple reader by allowing VLMs—representing different expertise levels (e.g., student, professor)—to read each generated poster and answer the quiz. The poster that achieves the highest average score is considered the most effective in conveying the paper's content.
  • Figure 4: Illustration of the PosterAgent pipeline. Given an input paper, PosterAgent generates a structured academic poster through three modules: 1. Parser: Extracts key textual and visual assets using a combination of tools and LLM-based summarization, resulting in a structured asset library. 2. Planner: Matches assets and arranges them into coherent layouts, iteratively generating panels with a zoom-in operation. 3. Painter–Commenter: The Painter generates panel-level bullet-content along with executable code, and renders the visual output, while the Commenter—a VLM with in-context reference—provides feedback to ensure layout coherence and prevent content overflow.
  • Figure 5: PaperQuiz's Avg. scores across different Reader VLMs (x-axis) for each poster type (legend lines). Refer to Append. Tab. \ref{['tab:abbrev_fullname']} for full model names.
  • ...and 24 more figures