Table of Contents
Fetching ...

MatViX: Multimodal Information Extraction from Visually Rich Articles

Ghazal Khalighinejad, Sharon Scott, Ollie Liu, Kelly L. Anderson, Rickard Stureborg, Aman Tyagi, Bhuwan Dhingra

TL;DR

This work introduces a benchmark consisting of full-length research articles and complex structured JSON files, carefully curated by domain experts, providing a comprehensive challenge for MIE, and introduces an evaluation method to assess the accuracy of curve similarity and the alignment of hierarchical structures.

Abstract

Multimodal information extraction (MIE) is crucial for scientific literature, where valuable data is often spread across text, figures, and tables. In materials science, extracting structured information from research articles can accelerate the discovery of new materials. However, the multimodal nature and complex interconnections of scientific content present challenges for traditional text-based methods. We introduce \textsc{MatViX}, a benchmark consisting of $324$ full-length research articles and $1,688$ complex structured JSON files, carefully curated by domain experts. These JSON files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE. We introduce an evaluation method to assess the accuracy of curve similarity and the alignment of hierarchical structures. Additionally, we benchmark vision-language models (VLMs) in a zero-shot manner, capable of processing long contexts and multimodal inputs, and show that using a specialized model (DePlot) can improve performance in extracting curves. Our results demonstrate significant room for improvement in current models. Our dataset and evaluation code are available\footnote{\url{https://matvix-bench.github.io/}}.

MatViX: Multimodal Information Extraction from Visually Rich Articles

TL;DR

This work introduces a benchmark consisting of full-length research articles and complex structured JSON files, carefully curated by domain experts, providing a comprehensive challenge for MIE, and introduces an evaluation method to assess the accuracy of curve similarity and the alignment of hierarchical structures.

Abstract

Multimodal information extraction (MIE) is crucial for scientific literature, where valuable data is often spread across text, figures, and tables. In materials science, extracting structured information from research articles can accelerate the discovery of new materials. However, the multimodal nature and complex interconnections of scientific content present challenges for traditional text-based methods. We introduce \textsc{MatViX}, a benchmark consisting of full-length research articles and complex structured JSON files, carefully curated by domain experts. These JSON files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE. We introduce an evaluation method to assess the accuracy of curve similarity and the alignment of hierarchical structures. Additionally, we benchmark vision-language models (VLMs) in a zero-shot manner, capable of processing long contexts and multimodal inputs, and show that using a specialized model (DePlot) can improve performance in extracting curves. Our results demonstrate significant room for improvement in current models. Our dataset and evaluation code are available\footnote{\url{https://matvix-bench.github.io/}}.

Paper Structure

This paper contains 51 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Example of an article with interconnected data between text and figures, with a JSON structure capturing sample properties and composition details.
  • Figure 2: A figure and its corresponding sample. Note how the data points in the properties are coming from the plot in the image. Also note that the data points in the JSON are shortened to fit on the page; the actual JSON is much larger. Some information in the JSON, like the full name of the filler PST, is not shown in the figure but can be found in the text. See the original article DANG2008171.
  • Figure 3: A figure and its corresponding sample. Note how the data points in the properties are derived from the plot in the image. There are three types of data points in this plot; while these are not explicitly labeled in the image, the figure title specifies which samples each type corresponds to. See the original article VANGINKEL1992319.
  • Figure 4: Annotation guidelines for identifying PBD sample compositions and properties.
  • Figure 5: Sample prompt to GPT-4o for extracting nanocomposite samples. The provided article has been truncated due to space constraints. The input is given without including the figures parsed by DePlot and without providing any images—only textual input was given to the model.
  • ...and 4 more figures