Table of Contents
Fetching ...

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023

Ting-Yao E. Hsu, Yi-Li Hsu, Shaurya Rohatgi, Chieh-Yang Huang, Ho Yin Sam Ng, Ryan Rossi, Sungchul Kim, Tong Yu, Lun-Wei Ku, C. Lee Giles, Ting-Hao K. Huang

TL;DR

This work assesses whether modern large multimodal models can solve the task of captioning scientific figures by evaluating the first SciCap Challenge (2023) with an expanded, multi-domain dataset. Through automatic metrics and extensive human evaluations by expert editors, the study finds that GPT-4V captions are overwhelmingly preferred over author-written captions and other models, suggesting substantial progress but not a complete solution. Analyses explore why editors favor GPT-4V, generalizability to unseen arXiv figures, and differences in perception between expert editors and lay readers, while highlighting ongoing challenges in evaluation, detail, and personalization. The results imply significant practical potential for AI-assisted captioning in scholarly communication, alongside a call for robust, domain-aware evaluation methods and models capable of user-specific caption customization.

Abstract

Since the SciCap datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023

TL;DR

This work assesses whether modern large multimodal models can solve the task of captioning scientific figures by evaluating the first SciCap Challenge (2023) with an expanded, multi-domain dataset. Through automatic metrics and extensive human evaluations by expert editors, the study finds that GPT-4V captions are overwhelmingly preferred over author-written captions and other models, suggesting substantial progress but not a complete solution. Analyses explore why editors favor GPT-4V, generalizability to unseen arXiv figures, and differences in perception between expert editors and lay readers, while highlighting ongoing challenges in evaluation, detail, and personalization. The results imply significant practical potential for AI-assisted captioning in scholarly communication, alongside a call for robust, domain-aware evaluation methods and models capable of user-specific caption customization.

Abstract

Since the SciCap datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?

Paper Structure

This paper contains 50 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: In SciCap Challenge, models generate captions based on the figure and the figure-mentioning paragraph. [Paper source: ngunderstanding]
  • Figure 2: ROUGE-2 normalized scores of each model across eight arXiv domains, highlighting similar trends and demonstrating the generalizability of the caption generation approaches.
  • Figure 3: ROUGE-2 scores by model across five figure types, showing similar trends.
  • Figure 4: Rankings of generated captions by all models in Study 2 across three evaluation conditions (A, B, C) and three experts. Models were ranked from 1 (best, green) to 6 (worst, red). GPT-4V (Image+Paragraph) consistently outperformed other models, including humans, across varying length constraints: none (A), a 25-word limit (B), and a strict limit matching human-written caption length (C).
  • Figure 5: The drag-and-drop interface used by professional editors to rank captions for a figure.
  • ...and 5 more figures