Table of Contents
Fetching ...

MedICaT: A Dataset of Medical Images, Captions, and Textual References

Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi

TL;DR

MedICaT addresses the gap in context-rich medical figure understanding by compiling 217K figures from 131K open-access papers with captions, inline references for ~74% of figures, and substantial subfigure/subcaption annotations for 2069 compound figures (7,507 subcaptions). It introduces the subfigure-subcaption alignment task and provides strong baselines based on transformer-CRF hybrids, achieving $F1\approx0.674$ and subfigure detection $mAP\approx79.3$ on held-out data, while showing that inline references improve image-text matching. The dataset links with ROCO and S2ORC to enable broader figure-text retrieval and supports multi-modal medical understanding and pre-training opportunities. Overall, MedICaT enables studying figures in their scientific context and facilitates downstream tasks such as captioning, VQA, and knowledge-graph construction in medicine.

Abstract

Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.

MedICaT: A Dataset of Medical Images, Captions, and Textual References

TL;DR

MedICaT addresses the gap in context-rich medical figure understanding by compiling 217K figures from 131K open-access papers with captions, inline references for ~74% of figures, and substantial subfigure/subcaption annotations for 2069 compound figures (7,507 subcaptions). It introduces the subfigure-subcaption alignment task and provides strong baselines based on transformer-CRF hybrids, achieving and subfigure detection on held-out data, while showing that inline references improve image-text matching. The dataset links with ROCO and S2ORC to enable broader figure-text retrieval and supports multi-modal medical understanding and pre-training opportunities. Overall, MedICaT enables studying figures in their scientific context and facilitates downstream tasks such as captioning, VQA, and knowledge-graph construction in medicine.

Abstract

Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.

Paper Structure

This paper contains 28 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example of color-coded subfigures and corresponding subcaptions (top), with an example inline reference from the full text. Figure and text adapted from Dhungana2018RareCL.
  • Figure 2: Subfigure to subcaption alignment is challenging for this figure because the subfigures are referenced by spatial position in the right-to-left order. Corresponding subfigures and subcaptions are indicated by color. Figure and caption adapted from Ohkura2015PrimaryAL.
  • Figure 3: Extracted figures from anonymized are aligned with the S2ORC parse of the paper PDF to link caption text (red) and inline references (blue) to each image. Example figure from Dhungana2018RareCL