Table of Contents
Fetching ...

Measuring Vision-Language STEM Skills of Neural Models

Jianhao Shen, Ye Yuan, Srbuhi Mirzoyan, Ming Zhang, Chenguang Wang

TL;DR

This work introduces STEM, a large-scale multimodal benchmark that assesses vision-language STEM skills across Science, Technology, Engineering, and Mathematics for learners from pre-K to 8th grade, comprising 448 skills and 1,073,146 questions. By evaluating a broad set of foundation models (e.g., CLIP, GPT-3.5-Turbo) and language models under zero-shot and finetuning/few-shot regimes, the study reveals that current models achieve only modest gains and remain far below average elementary students, with notable difficulty in math and abstract reasoning. The paper provides a rich meta-information framework (skills, grades, subjects) enabling fine-grained analyses, calibration studies, and scaling-law insights, and demonstrates substantial gains from finetuning yet emphasizes that scale alone cannot close the human gap. Collectively, STEM serves as a challenging testbed that motivates novel algorithmic innovations to enable robust multimodal STEM problem solving in real-world settings.

Abstract

We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community.

Measuring Vision-Language STEM Skills of Neural Models

TL;DR

This work introduces STEM, a large-scale multimodal benchmark that assesses vision-language STEM skills across Science, Technology, Engineering, and Mathematics for learners from pre-K to 8th grade, comprising 448 skills and 1,073,146 questions. By evaluating a broad set of foundation models (e.g., CLIP, GPT-3.5-Turbo) and language models under zero-shot and finetuning/few-shot regimes, the study reveals that current models achieve only modest gains and remain far below average elementary students, with notable difficulty in math and abstract reasoning. The paper provides a rich meta-information framework (skills, grades, subjects) enabling fine-grained analyses, calibration studies, and scaling-law insights, and demonstrates substantial gains from finetuning yet emphasizes that scale alone cannot close the human gap. Collectively, STEM serves as a challenging testbed that motivates novel algorithmic innovations to enable robust multimodal STEM problem solving in real-world settings.

Abstract

We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community.
Paper Structure (53 sections, 37 figures, 27 tables)

This paper contains 53 sections, 37 figures, 27 tables.

Figures (37)

  • Figure 2: A summary of skills.
  • Figure 3: Skill comparison between STEM and existing datasets (IconQA and ScienceQA).
  • Figure 5: Results categorized by sampled skills of each subject. M: math. S: science. T: technology. E: engineering. Full results are in the appendix.
  • Figure 6: Average grade-level exam scores.
  • Figure 7: CLIP calibration results.
  • ...and 32 more figures