Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots
Maciej P. Polak, Dane Morgan
TL;DR
This work addresses the manual bottleneck of extracting quantitative data from plots by introducing PlotExtract, a zero-shot workflow that leverages vision-capable LLMs to digitize data from two-axis plots. The method sequentially extracts data, generates and runs replotting code, and verifies accuracy via a visual comparison, without any model fine-tuning. Across synthetic and published datasets, it achieves average data-extraction errors in the low single-digit percent range, with precision effectively at or near $100\%$ and recall around $85$–$90\%$, demonstrating high-throughput viability. The approach promises automated, scalable data extraction from plots, with expected improvements as multimodal LLMs continue to advance.
Abstract
Automated data extraction from research texts has been steadily improving, with the emergence of large language models (LLMs) accelerating progress even further. Extracting data from plots in research papers, however, has been such a complex task that it has predominantly been confined to manual data extraction. We show that current multimodal large language models, with proper instructions and engineered workflows, are capable of accurately extracting data from plots. This capability is inherent to the pretrained models and can be achieved with a chain-of-thought sequence of zero-shot engineered prompts we call PlotExtract, without the need to fine-tune. We demonstrate PlotExtract here and assess its performance on synthetic and published plots. We consider only plots with two axes in this analysis. For plots identified as extractable, PlotExtract finds points with over 90% precision (and around 90% recall) and errors in x and y position of around 5% or lower. These results prove that multimodal LLMs are a viable path for high-throughput data extraction for plots and in many circumstances can replace the current manual methods of data extraction.
