OpenStaxQA: A multilingual dataset based on open-source college textbooks
Pranav Gupta
TL;DR
OpenStaxQA addresses the shortage of high-quality multilingual, college-level STEM datasets by constructing 18,332 end-of-chapter problem-solution pairs from 43 OpenStax textbooks in English, Spanish, and Polish. The authors finetune ~7B-parameter LLMs using quantized low-rank adapters (QLoRa) and evaluate on both OpenStaxQA and zero-shot AI2RC; they also use GPT-4 as an oracle for response-rating and release scraping code and data. Results show finetuned models outperform baselines on OpenStaxQA and exhibit improved zero-shot performance on AI2RC, though results are impacted by data quality and language distribution. The work highlights the feasibility of converting open textbook content into scalable NLP datasets, promotes standardization of scraping/formatting workflows, and lays groundwork for expanding multilingual, domain-specific educational tools while addressing ethical and accessibility considerations.
Abstract
We present OpenStaxQA, an evaluation benchmark specific to college-level educational applications based on 43 open-source college textbooks in English, Spanish, and Polish, available under a permissive Creative Commons license. We finetune and evaluate large language models (LLMs) with approximately 7 billion parameters on this dataset using quantized low rank adapters (QLoRa). Additionally we also perform a zero-shot evaluation on the AI2 reasoning challenge dev dataset in order to check if OpenStaxQA can lead to an improved performance on other tasks. We also discuss broader impacts relevant to datasets such as OpenStaxQA.
