Table of Contents
Fetching ...

LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ

Marc-Antoine Allard, Matin Ansaripour, Maria Yuffa, Paul Teiletche

TL;DR

LLaMa-SciQ tackles the difficulty of mathematical reasoning in STEM MCQs by fine-tuning and aligning LLaMa-3-8B with STEM-focused datasets (StemQA, StemDPO, StemMCQ) and evaluating the model on GSM8K, MATH, and EPFL MCQ benchmarks. The approach combines SFT and Direct Preference Optimization, with an optional Retrieval Augmented Generation pipeline and a 4-bit quantization step to improve accessibility. Key findings show strong math reasoning performance (e.g., $74.5\%$ on GSM8K and $30\%$ on MATH) and that RAG does not consistently improve accuracy, while quantization yields only ~5\% loss, enabling faster, cheaper deployment. The work contributes specialized datasets and an end-to-end pipeline that balances accuracy and efficiency for a student-focused STEM MCQ assistant, with future directions toward better prompting, multilingual support, and bias mitigation.

Abstract

Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs). To address this issue, we developed LLaMa-SciQ, an educational chatbot designed to assist college students in solving and understanding MCQs in STEM fields. We begin by fine-tuning and aligning the models to human preferences. After comparing the performance of Mistral-7B and LLaMa-8B, we selected the latter as the base model due to its higher evaluation accuracy. To further enhance accuracy, we implement Retrieval-Augmented Generation (RAG) and apply quantization to compress the model, reducing inference time and increasing accessibility for students. For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset. However, RAG does not improve performance and even reduces it, likely due to retriever issues or the model's unfamiliarity with context. Despite this, the quantized model shows only a 5% loss in performance, demonstrating significant efficiency improvements.

LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ

TL;DR

LLaMa-SciQ tackles the difficulty of mathematical reasoning in STEM MCQs by fine-tuning and aligning LLaMa-3-8B with STEM-focused datasets (StemQA, StemDPO, StemMCQ) and evaluating the model on GSM8K, MATH, and EPFL MCQ benchmarks. The approach combines SFT and Direct Preference Optimization, with an optional Retrieval Augmented Generation pipeline and a 4-bit quantization step to improve accessibility. Key findings show strong math reasoning performance (e.g., on GSM8K and on MATH) and that RAG does not consistently improve accuracy, while quantization yields only ~5\% loss, enabling faster, cheaper deployment. The work contributes specialized datasets and an end-to-end pipeline that balances accuracy and efficiency for a student-focused STEM MCQ assistant, with future directions toward better prompting, multilingual support, and bias mitigation.

Abstract

Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs). To address this issue, we developed LLaMa-SciQ, an educational chatbot designed to assist college students in solving and understanding MCQs in STEM fields. We begin by fine-tuning and aligning the models to human preferences. After comparing the performance of Mistral-7B and LLaMa-8B, we selected the latter as the base model due to its higher evaluation accuracy. To further enhance accuracy, we implement Retrieval-Augmented Generation (RAG) and apply quantization to compress the model, reducing inference time and increasing accessibility for students. For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset. However, RAG does not improve performance and even reduces it, likely due to retriever issues or the model's unfamiliarity with context. Despite this, the quantized model shows only a 5% loss in performance, demonstrating significant efficiency improvements.
Paper Structure (42 sections, 2 equations, 5 figures, 10 tables)

This paper contains 42 sections, 2 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The Training Pipeline: Organized into three consecutive stages; Supervised Fine-Tuning, Direct Preference Optimization Training, and Multiple Choice Question Answering Specialization.
  • Figure 2: The RAG Pipeline
  • Figure 3: MCQ-SFT Training Loss
  • Figure 4: SFT Training statistics for Llama and Mistral models on 100,000 samples.
  • Figure 5: Training Analytics: Transformers Models