Table of Contents
Fetching ...

GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

Yaohan Guan, Pristina Wang, Najim Dehak, Alan Yuille, Jieneng Chen, Daniel Khashabi

Abstract

In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.

GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

Abstract

In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.

Paper Structure

This paper contains 45 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Examples from GenFig1 (the first row from liu2024aligning and second row from wu2022sentence). The task is to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Example responses from models we evaluate are shown in the middle column. Solving GenFig1 requires more than just visually appealing graphics: the task entails cross-modal reasoning that couples scientific understanding with visual synthesis.
  • Figure 2: Resulted taxonomy of Figure 1s. We define three taxonomies(Overview, Example, and Experimental Results) and multiple sub-taxonomies, where Overview and Example contribute more. Among them, Example–Background and Example–Method are the most frequent, followed by Overview–Model Architecture and Overview–Method.
  • Figure 3: Figure 1 examples produced by both humans and models for all baselines from the paper Liu2023AligningLL. In the first row are the ones from Humans, Zero-shot, CoT; the second row: Zero-shot SVG, CoT SVG, Chain-of-images. Where we can see that the ones from Zero-shot, CoT and Chain-of-images are relatively decent, although cropped.
  • Figure 4: Examples for taxonomy of Figure 1s.
  • Figure 5: UMAP visualizations of the paper representations: (a) clustered by venues and (b) clustered by research fields. Papers from different venues show some clustering tendencies, but there is considerable overlap, indicating shared or transferable representations across domains.
  • ...and 2 more figures