Table of Contents
Fetching ...

Beyond Search Engines: Can Large Language Models Improve Curriculum Development?

Mohammad Moein, Mohammadreza Molavi Hajiagha, Abdolali Faraji, Mohammadreza Tavakoli, Gàbor Kismihòk

TL;DR

The paper tackles the problem of keeping online curricula current by proposing a framework that uses large language models to generate relevant learning topics for courses, using course titles as input and evaluating against YouTube-derived baselines. It builds a dataset from YouTube playlists across 25 learning areas, generates topic lists with GPT-4 and GPT-3.5 (with multiple samples), and assesses alignment with ground-truth topics using $F_1$-based BERTScore. Key findings show GPT-4 achieving an $F_1$ of around $0.30$, slightly better than the YouTube baseline, while GPT-3.5 underperforms, indicating potential for LLM-assisted curriculum design and highlighting recall limitations. The work contributes a reproducible dataset and an evaluation framework to advance AI-assisted curriculum development, while acknowledging limitations from relying on YouTube as ground truth and suggesting broader data sources for future work.

Abstract

While Online Learning is growing and becoming widespread, the associated curricula often suffer from a lack of coverage and outdated content. In this regard, a key question is how to dynamically define the topics that must be covered to thoroughly learn a subject (e.g., a course). Large Language Models (LLMs) are considered candidates that can be used to address curriculum development challenges. Therefore, we developed a framework and a novel dataset, built on YouTube, to evaluate LLMs' performance when it comes to generating learning topics for specific courses. The experiment was conducted across over 100 courses and nearly 7,000 YouTube playlists in various subject areas. Our results indicate that GPT-4 can produce more accurate topics for the given courses than extracted topics from YouTube video playlists in terms of BERTScore

Beyond Search Engines: Can Large Language Models Improve Curriculum Development?

TL;DR

The paper tackles the problem of keeping online curricula current by proposing a framework that uses large language models to generate relevant learning topics for courses, using course titles as input and evaluating against YouTube-derived baselines. It builds a dataset from YouTube playlists across 25 learning areas, generates topic lists with GPT-4 and GPT-3.5 (with multiple samples), and assesses alignment with ground-truth topics using -based BERTScore. Key findings show GPT-4 achieving an of around , slightly better than the YouTube baseline, while GPT-3.5 underperforms, indicating potential for LLM-assisted curriculum design and highlighting recall limitations. The work contributes a reproducible dataset and an evaluation framework to advance AI-assisted curriculum development, while acknowledging limitations from relying on YouTube as ground truth and suggesting broader data sources for future work.

Abstract

While Online Learning is growing and becoming widespread, the associated curricula often suffer from a lack of coverage and outdated content. In this regard, a key question is how to dynamically define the topics that must be covered to thoroughly learn a subject (e.g., a course). Large Language Models (LLMs) are considered candidates that can be used to address curriculum development challenges. Therefore, we developed a framework and a novel dataset, built on YouTube, to evaluate LLMs' performance when it comes to generating learning topics for specific courses. The experiment was conducted across over 100 courses and nearly 7,000 YouTube playlists in various subject areas. Our results indicate that GPT-4 can produce more accurate topics for the given courses than extracted topics from YouTube video playlists in terms of BERTScore

Paper Structure

This paper contains 4 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Data collection and evaluation pipeline: Ground truth building is illustrated in Parts 1-3; Part 4 shows the topic generation with GPT-4, GPT-3.5, and YouTube. Finally, model performance is assessed using BERTScore in Part 5.
  • Figure 2: Performance for each area. The graph shows the mean and 95% confidence interval. Areas where GPT-4 outperformed YouTube are bold faced.