Table of Contents
Fetching ...

Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course

Sebastian Kahl, Felix Löffler, Martin Maciol, Fabian Ridder, Marius Schmitz, Jennifer Spanagel, Jens Wienkamp, Christopher Burgahn, Malte Schilling

TL;DR

This work examines how advanced LLM techniques can enable AI tutors for a robotics course, addressing hallucinations and evaluation challenges. It compares prompt engineering, Retrieval-Augmented Generation, and fine-tuning using GPT-3.5 and LLaMA-2-13B across a robotics QA task with 478 test chats and 2791 training chats, grounded in lecture material. The results show that RAG plus prompt engineering substantially improves factuality and perceived usefulness, while small fine tuned models can approach or exceed larger baselines when used without RAG, though combining RAG with fine tuning can cause overfitting. The study highlights metric correlations and biases toward brevity, underscoring the need for robust, domain-specific evaluation frameworks to guide AI powered tutoring in education and to inform deployment sequencing.

Abstract

This study evaluates the performance of Large Language Models (LLMs) as an Artificial Intelligence-based tutor for a university course. In particular, different advanced techniques are utilized, such as prompt engineering, Retrieval-Augmented-Generation (RAG), and fine-tuning. We assessed the different models and applied techniques using common similarity metrics like BLEU-4, ROUGE, and BERTScore, complemented by a small human evaluation of helpfulness and trustworthiness. Our findings indicate that RAG combined with prompt engineering significantly enhances model responses and produces better factual answers. In the context of education, RAG appears as an ideal technique as it is based on enriching the input of the model with additional information and material which usually is already present for a university course. Fine-tuning, on the other hand, can produce quite small, still strong expert models, but poses the danger of overfitting. Our study further asks how we measure performance of LLMs and how well current measurements represent correctness or relevance? We find high correlation on similarity metrics and a bias of most of these metrics towards shorter responses. Overall, our research points to both the potential and challenges of integrating LLMs in educational settings, suggesting a need for balanced training approaches and advanced evaluation frameworks.

Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course

TL;DR

This work examines how advanced LLM techniques can enable AI tutors for a robotics course, addressing hallucinations and evaluation challenges. It compares prompt engineering, Retrieval-Augmented Generation, and fine-tuning using GPT-3.5 and LLaMA-2-13B across a robotics QA task with 478 test chats and 2791 training chats, grounded in lecture material. The results show that RAG plus prompt engineering substantially improves factuality and perceived usefulness, while small fine tuned models can approach or exceed larger baselines when used without RAG, though combining RAG with fine tuning can cause overfitting. The study highlights metric correlations and biases toward brevity, underscoring the need for robust, domain-specific evaluation frameworks to guide AI powered tutoring in education and to inform deployment sequencing.

Abstract

This study evaluates the performance of Large Language Models (LLMs) as an Artificial Intelligence-based tutor for a university course. In particular, different advanced techniques are utilized, such as prompt engineering, Retrieval-Augmented-Generation (RAG), and fine-tuning. We assessed the different models and applied techniques using common similarity metrics like BLEU-4, ROUGE, and BERTScore, complemented by a small human evaluation of helpfulness and trustworthiness. Our findings indicate that RAG combined with prompt engineering significantly enhances model responses and produces better factual answers. In the context of education, RAG appears as an ideal technique as it is based on enriching the input of the model with additional information and material which usually is already present for a university course. Fine-tuning, on the other hand, can produce quite small, still strong expert models, but poses the danger of overfitting. Our study further asks how we measure performance of LLMs and how well current measurements represent correctness or relevance? We find high correlation on similarity metrics and a bias of most of these metrics towards shorter responses. Overall, our research points to both the potential and challenges of integrating LLMs in educational settings, suggesting a need for balanced training approaches and advanced evaluation frameworks.
Paper Structure (17 sections, 8 figures, 7 tables)

This paper contains 17 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of process for integrating RAG into the query to an LLM: The user asks a question (1) through an interface. For the embedded question additional information is retrieved from a vector database (2). The query to the LLM (3) integrates the original posed question and retrieved information into a---usually engineered---specific prompt. The LLM generates a completion as a response (4). Often, the reply has to be unwraped and can be passed to the user (5).
  • Figure 2: Overview of the query process when using a fine-tuned model (the access process is identically to directly querying a given LLM, but can of course be extended to as well include RAG): The user asks a question (1) through an interface which is given to the LLM (2). The LLM generates a completion as a response (3) which is passed to the user (4). The important difference is that task specific knowledge or interaction specific patterns are entrained into the model beforehand, i.e. during the fine-tuning stage (shown on the right in red).
  • Figure 3: Evaluation of Large Language Models: (a) BLEU-4 score which measures the precision of n-grams (here 4-grams) in the generated text compared to the ground truth text, while (b) ROUGE evaluates recall, measuring the overlap of n-grams between the generated and reference texts.
  • Figure 4: BERTScore as an Evaluation Metrics for Large Language Models which uses BERT embeddings for semantic similarity computation.
  • Figure 5: Correlation Matrix of Evaluation Metrics: The color-coded matrix represents correlations between evaluation metrics used to assess the performance of LLMs (measured on our test set). Each cell in the matrix indicates the correlation coefficient between two metrics, with color intensity (from blue to red) reflecting the strength of the correlation. Metrics include traditional similarity scores like BLEU and ROUGE, as well as newer ones like BERTScore, and their relationships with model output features, e.g., token count and human evaluation scores on trustworthiness or helpfulness.
  • ...and 3 more figures