Table of Contents
Fetching ...

A Psychology-based Unified Dynamic Framework for Curriculum Learning

Guangyu Meng, Qingkai Zeng, John P. Lalor, Hong Yu

TL;DR

PUDF presents a psychology-based unified dynamic framework for curriculum learning by combining IRT-based artificial crowds (IRT-AC) to label data difficulty with a model-ability guided data scheduler (DDS-MAE). The approach yields globally interpretable difficulty scores and dynamic, data-efficient training that accelerates convergence while improving accuracy across large language models and diverse tasks. Empirical results show PUDF outperforming standard fine-tuning and several state-of-the-art CL methods in both speed and predictive performance, with notable gains on large-scale datasets like AG News and challenging tasks like MedQA. The work demonstrates PUDF’s scalability, theoretical rigor, and potential to generalize to generative tasks, offering a principled path for adaptive curriculum design in NLP and beyond.

Abstract

Directly learning from examples of varying difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. Drawing inspiration from psychometrics, this paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF). We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a training strategy, Dynamic Data Selection via Model Ability Estimation (DDS-MAE), to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to aligned training data selection and faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained large language models with PUDF leads to higher accuracy and faster convergence on a suite of benchmark datasets compared to standard fine-tuning and state-of-the-art CL methods. Ablation studies and downstream analyses further validate the impact of PUDF for CL.

A Psychology-based Unified Dynamic Framework for Curriculum Learning

TL;DR

PUDF presents a psychology-based unified dynamic framework for curriculum learning by combining IRT-based artificial crowds (IRT-AC) to label data difficulty with a model-ability guided data scheduler (DDS-MAE). The approach yields globally interpretable difficulty scores and dynamic, data-efficient training that accelerates convergence while improving accuracy across large language models and diverse tasks. Empirical results show PUDF outperforming standard fine-tuning and several state-of-the-art CL methods in both speed and predictive performance, with notable gains on large-scale datasets like AG News and challenging tasks like MedQA. The work demonstrates PUDF’s scalability, theoretical rigor, and potential to generalize to generative tasks, offering a principled path for adaptive curriculum design in NLP and beyond.

Abstract

Directly learning from examples of varying difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. Drawing inspiration from psychometrics, this paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF). We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a training strategy, Dynamic Data Selection via Model Ability Estimation (DDS-MAE), to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to aligned training data selection and faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained large language models with PUDF leads to higher accuracy and faster convergence on a suite of benchmark datasets compared to standard fine-tuning and state-of-the-art CL methods. Ablation studies and downstream analyses further validate the impact of PUDF for CL.
Paper Structure (42 sections, 6 equations, 9 figures, 18 tables, 1 algorithm)

This paper contains 42 sections, 6 equations, 9 figures, 18 tables, 1 algorithm.

Figures (9)

  • Figure 1: Plot of $p(z_{ij} = 1 | \theta_j, b_i)$ as a function of $\theta_j$ for two examples: (\ref{['fig:irtA']}) an example with difficulty $b_i=0$, and (\ref{['fig:irtB']}) a more difficult example ($b_i=2$). Models with ability $\theta_j > b_i$ (right of dashed line) have greater than 50% chance of labeling the example correctly.
  • Figure 2: Workflow of PUDF. The process consists of two main steps: 1) IRT-AC for the DM, 2) DDS-MAE and LLM Fine-tuning for the TS.
  • Figure 3: Comparing training time between PUDF and other CL methods. All runtimes reported in minutes. GLUE scores are reported as the mean across tasks, pooled by runs. $*$Indicates that the runtime is significantly longer than PUDF (Welch's single-tailed t-test with Benjamini-Hochberg correction, $\alpha < 0.05$).
  • Figure 4: Convergence analysis of the proposed PUDF against the Qwen2.5-7B baseline on AG News, MedQA, and the GLUE benchmark datasets. The solid lines represent validation accuracy, while the dotted lines indicate the percentage of training data utilized per epoch. Circular markers highlight the epoch with the best validation accuracy achieved by each model.
  • Figure 5: IRT-AC generated difficulty distributions for the GLUE benchmark, AG News, and MedQA datasets.
  • ...and 4 more figures