Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs
Bianca Raimondi, Saverio Giallorenzo, Maurizio Gabbrielli
TL;DR
This study investigates the affordability and effectiveness of using open-source LLMs to answer course-specific MCQs in Programming Languages. It compares generic LLaMA-2 variants (7B, 13B, 70B) in zero-shot inference with and without 4-bit quantisation and demonstrates that small, textbook-based fine-tuned models can match or exceed the accuracy of larger pre-trained models under practical hardware constraints. The authors employ LoRA and qLoRA for efficiency, fine-tune on a chapter-based subset of a PL textbook, and analyze how fine-tuning dataset choice, quantisation, and hyperparameters influence performance, finding that the domain-specific fine-tuning on Structuring Data significantly boosts SD MCQ accuracy up to around 70% and beyond in some configurations. The work highlights a viable path for affordable, classroom-friendly LLM deployment, while acknowledging limitations such as domain scope, potential catastrophic forgetting in smaller models, and the need for broader multi-domain validation and potential multimodal extensions.
Abstract
In education, the capability of generating human-like text of Large Language Models (LLMs) inspired work on how they can increase the efficiency of learning and teaching. We study the affordability of these models for educators and students by investigating how LLMs answer multiple-choice questions (MCQs) with respect to hardware constraints and refinement techniques. We explore this space by using generic pre-trained LLMs (the 7B, 13B, and 70B variants of LLaMA-2) to answer 162 undergraduate-level MCQs from a course on Programming Languages (PL) -- the MCQ dataset is a contribution of this work, which we make publicly available. Specifically, we dissect how different factors, such as using readily-available material -- (parts of) the course's textbook -- for fine-tuning and quantisation (to decrease resource usage) can change the accuracy of the responses. The main takeaway is that smaller textbook-based fine-tuned models outperform generic larger ones (whose pre-training requires conspicuous resources), making the usage of LLMs for answering MCQs resource- and material-wise affordable.
