Table of Contents
Fetching ...

AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu

Abstract

Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.

AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

Abstract

Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.

Paper Structure

This paper contains 25 sections, 1 equation, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Overview of AIBench. We introduce a comprehensive benchmark to evaluate academic illustration generation through two primary dimensions: question-answering-based logical evaluation and model-based aesthetic assessment. To systematically assess logical accuracy, we collect top-tier conference papers, construct text-to-logic directed graphs, and manually annotate QA pairs across four hierarchical levels. Ultimately, our AIBench establishes a new evaluation standard for generated academic illustrations.
  • Figure 2: QA data construction pipeline of AIBench. We construct the multi-level QA pairs with the help of Gemini, followed by careful human annotation.
  • Figure 3: Statistics analysis of AIBench. (a) Papers are curated from four representative 2025 conferences (CVPR, ICCV, NeurIPS, ICLR). (b) Models are evaluated on four hierarchical QA levels: Component, Topology, Phase, and Semantics. (c) Word cloud shows lexical frequency and topic diversity of our AIBench. (d) Top research topics in 2025 research papers, covering diffusion, LLMs, 3D reconstruction, etc.
  • Figure 4: The evaluation pipeline of AIBench. Models generate academic illustrations based on method descriptions, and are subsequently evaluated on two primary dimensions: logical accuracy, assessed via paper-specific QA pairs across four hierarchical levels (L1–L4, ranging from component existence to global semantics), and visual appeal, measured by a model-based aesthetic score.
  • Figure 5: Qualitative examples of typical generation failure modes leading to incorrect answers: (a) missing components, (b) layout errors, (c) hallucinated reasoning/incorrect logic, and (d) unclear text rendering.
  • ...and 14 more figures