PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

Mayank Ravishankara

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

Mayank Ravishankara

TL;DR

PlotChain delivers a deterministic, generator-based benchmark for quantitative engineering plot reading with exact ground truth derived from generation parameters. It introduces checkpoint-based diagnostics (cp_ fields) to localize failures, spans $15$ plot families with $450$ items, and enforces strict JSON numeric outputs under a fixed decoding policy for reproducible evaluation. The study demonstrates that frontier multimodal evaluators achieve strong field-level accuracy ($\sim$78–80%), but struggle with derived-quantity tasks (e.g., Bandpass, FFT) and multi-field end-to-end completion, highlighting actionable diagnostic insights. By releasing the generator, data, raw outputs, and evaluation tools with checksums, PlotChain enables reproducible re-scoring under alternative tolerance policies and sets a foundation for future extensions to broader plot styles and real-world noise sources.

Abstract

We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or free-form captioning. PlotChain contains 15 plot families with 450 rendered plots (30 per family), where every item is produced from known parameters and paired with exact ground truth computed directly from the generating process. A central contribution is checkpoint-based diagnostic evaluation: in addition to final targets, each item includes intermediate 'cp_' fields that isolate sub-skills (e.g., reading cutoff frequency or peak magnitude) and enable failure localization within a plot family. We evaluate four state-of-the-art MLLMs under a standardized, deterministic protocol (temperature = 0 and a strict JSON-only numeric output schema) and score predictions using per-field tolerances designed to reflect human plot-reading precision. Under the 'plotread' tolerance policy, the top models achieve 80.42% (Gemini 2.5 Pro), 79.84% (GPT-4.1), and 78.21% (Claude Sonnet 4.5) overall field-level pass rates, while GPT-4o trails at 61.59%. Despite strong performance on many families, frequency-domain tasks remain brittle: bandpass response stays low (<= 23%), and FFT spectrum remains challenging. We release the generator, dataset, raw model outputs, scoring code, and manifests with checksums to support fully reproducible runs and retrospective rescoring under alternative tolerance policies.

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

TL;DR

plot families with

items, and enforces strict JSON numeric outputs under a fixed decoding policy for reproducible evaluation. The study demonstrates that frontier multimodal evaluators achieve strong field-level accuracy (

78–80%), but struggle with derived-quantity tasks (e.g., Bandpass, FFT) and multi-field end-to-end completion, highlighting actionable diagnostic insights. By releasing the generator, data, raw outputs, and evaluation tools with checksums, PlotChain enables reproducible re-scoring under alternative tolerance policies and sets a foundation for future extensions to broader plot styles and real-world noise sources.

Abstract

Paper Structure (38 sections, 1 equation, 4 figures, 4 tables)

This paper contains 38 sections, 1 equation, 4 figures, 4 tables.

Introduction
Contributions.
Related Work
Chart and Plot Question Answering Benchmarks
Derendering and Plot-to-Table Conversion
Chart Summarization and Accessibility
Chart-Focused Evaluation in the LVLM Era
Benchmarking Methodology and Reproducibility
Benchmark Design and Dataset
Task Definition
Plot Families
Deterministic Generation and Ground Truth
Difficulty and Edge-Case Design
Dataset Format and Released Artifacts
Experimental Setup and Evaluation
...and 23 more sections

Figures (4)

Figure 1: Representative PlotChain samples (one per family). Row 1: Step response; Bode magnitude; Bode phase; Bandpass frequency response; Time-domain waveform. Row 2: FFT magnitude spectrum; Spectrogram; Resistor I--V curve; Diode I--V curve; Transfer characteristic. Row 3: Pole--zero plot; Stress--strain curve; Torque--speed curve; Pump characteristic curve; S--N fatigue curve.
Figure 2: Family-level performance heatmap (final-field pass rate, %). Each cell is the percentage of final (non-cp_*) fields that pass tolerance, aggregated over all items in the family.
Figure 3: Headline ranking by item-level strict all-pass (all final fields must pass).
Figure 4: Final vs. checkpoint field pass rates by model under plotread.

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

TL;DR

Abstract

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

Authors

TL;DR

Abstract

Table of Contents

Figures (4)