Table of Contents
Fetching ...

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

Rohit Saxena, Pasquale Minervini, Frank Keller

TL;DR

PosterSum introduces a large-scale dataset of 16,305 poster–abstract pairs to benchmark multimodal poster summarization and reveals that state-of-the-art MLLMs underperform on visually complex posters. The authors propose Segment & Summarize, a hierarchical, training-free approach that segments posters into regions, generates localized region summaries, and merges them into a global abstract, achieving a new state-of-the-art ROUGE-L of 24.18. Across zero-shot and fine-tuned settings, Segment & Summarize outperforms both OCR-based baselines and various MLLMs, demonstrating the value of region-level reasoning for dense visual documents. The work provides a valuable resource for advancing multimodal understanding of scientific posters and highlights ongoing challenges in evaluation and bias when summarizing information-dense scientific content.

Abstract

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

TL;DR

PosterSum introduces a large-scale dataset of 16,305 poster–abstract pairs to benchmark multimodal poster summarization and reveals that state-of-the-art MLLMs underperform on visually complex posters. The authors propose Segment & Summarize, a hierarchical, training-free approach that segments posters into regions, generates localized region summaries, and merges them into a global abstract, achieving a new state-of-the-art ROUGE-L of 24.18. Across zero-shot and fine-tuned settings, Segment & Summarize outperforms both OCR-based baselines and various MLLMs, demonstrating the value of region-level reasoning for dense visual documents. The work provides a valuable resource for advancing multimodal understanding of scientific posters and highlights ongoing challenges in evaluation and bias when summarizing information-dense scientific content.

Abstract

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.

Paper Structure

This paper contains 42 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: An example of scientific poster from the PosterSum dataset. The poster, describing the work in GuptaJLA24, contains visual elements such as structured tables with numerical results, charts, diagrams, and textual sections, demonstrating the multimodal complexity present in the dataset.
  • Figure 2: Distribution of the PosterSum dataset.
  • Figure 3: Distribution of top 25 topics for the posters in the dataset.
  • Figure 4: Illustration of our Segment & Summarize pipeline. The poster, describing the work in rakitin2024regularized, is first divided into segments, each of which is summarized by a MLLM. These localized summaries are subsequently merged by a text-based large language model to generate a single, coherent summary.
  • Figure 5: Effect of text present in the poster on summarization. We report mean ROUGE-L scores for different OCR-extracted character-length bins. The red dashed line represents the number of posters in each bin.
  • ...and 1 more figures