Table of Contents
Fetching ...

OpenStaxQA: A multilingual dataset based on open-source college textbooks

Pranav Gupta

TL;DR

OpenStaxQA addresses the shortage of high-quality multilingual, college-level STEM datasets by constructing 18,332 end-of-chapter problem-solution pairs from 43 OpenStax textbooks in English, Spanish, and Polish. The authors finetune ~7B-parameter LLMs using quantized low-rank adapters (QLoRa) and evaluate on both OpenStaxQA and zero-shot AI2RC; they also use GPT-4 as an oracle for response-rating and release scraping code and data. Results show finetuned models outperform baselines on OpenStaxQA and exhibit improved zero-shot performance on AI2RC, though results are impacted by data quality and language distribution. The work highlights the feasibility of converting open textbook content into scalable NLP datasets, promotes standardization of scraping/formatting workflows, and lays groundwork for expanding multilingual, domain-specific educational tools while addressing ethical and accessibility considerations.

Abstract

We present OpenStaxQA, an evaluation benchmark specific to college-level educational applications based on 43 open-source college textbooks in English, Spanish, and Polish, available under a permissive Creative Commons license. We finetune and evaluate large language models (LLMs) with approximately 7 billion parameters on this dataset using quantized low rank adapters (QLoRa). Additionally we also perform a zero-shot evaluation on the AI2 reasoning challenge dev dataset in order to check if OpenStaxQA can lead to an improved performance on other tasks. We also discuss broader impacts relevant to datasets such as OpenStaxQA.

OpenStaxQA: A multilingual dataset based on open-source college textbooks

TL;DR

OpenStaxQA addresses the shortage of high-quality multilingual, college-level STEM datasets by constructing 18,332 end-of-chapter problem-solution pairs from 43 OpenStax textbooks in English, Spanish, and Polish. The authors finetune ~7B-parameter LLMs using quantized low-rank adapters (QLoRa) and evaluate on both OpenStaxQA and zero-shot AI2RC; they also use GPT-4 as an oracle for response-rating and release scraping code and data. Results show finetuned models outperform baselines on OpenStaxQA and exhibit improved zero-shot performance on AI2RC, though results are impacted by data quality and language distribution. The work highlights the feasibility of converting open textbook content into scalable NLP datasets, promotes standardization of scraping/formatting workflows, and lays groundwork for expanding multilingual, domain-specific educational tools while addressing ethical and accessibility considerations.

Abstract

We present OpenStaxQA, an evaluation benchmark specific to college-level educational applications based on 43 open-source college textbooks in English, Spanish, and Polish, available under a permissive Creative Commons license. We finetune and evaluate large language models (LLMs) with approximately 7 billion parameters on this dataset using quantized low rank adapters (QLoRa). Additionally we also perform a zero-shot evaluation on the AI2 reasoning challenge dev dataset in order to check if OpenStaxQA can lead to an improved performance on other tasks. We also discuss broader impacts relevant to datasets such as OpenStaxQA.

Paper Structure

This paper contains 10 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Scraping workflow followed for scraping Openstax question-answer pairs
  • Figure 2: Language distribution in the OpenStaxQA dataset
  • Figure 3: Distribution of fields of study in the OpenStaxQA dataset
  • Figure 4: Finetuning results on the OpenStaxQA dataset. Comparing Llama 7B untrained, Llama 7B finetuned, and Llemma 7B finetuned. Bar heights normalized for unequal sample sizes.
  • Figure 5: Finetuning results on the AI2RC dev dataset. Comparing Llama 7B untrained, Llama 7B finetuned, and Llemma 7B finetuned. Bar heights normalized for unequal sample sizes.