Table of Contents
Fetching ...

Does the Definition of Difficulty Matter? Scoring Functions and their Role for Curriculum Learning

Simon Rampp, Manuel Milling, Andreas Triantafyllopoulos, Björn W. Schuller

TL;DR

An extensive study on the robustness and similarity of the most common scoring functions for sample difficulty estimation, as well as their potential benefits in CL, using the popular benchmark dataset CIFAR-10 and the acoustic scene classification task from the DCASE2020 challenge as representatives of computer vision and computer audition.

Abstract

Curriculum learning (CL) describes a machine learning training strategy in which samples are gradually introduced into the training process based on their difficulty. Despite a partially contradictory body of evidence in the literature, CL finds popularity in deep learning research due to its promise of leveraging human-inspired curricula to achieve higher model performance. Yet, the subjectivity and biases that follow any necessary definition of difficulty, especially for those found in orderings derived from models or training statistics, have rarely been investigated. To shed more light on the underlying unanswered questions, we conduct an extensive study on the robustness and similarity of the most common scoring functions for sample difficulty estimation, as well as their potential benefits in CL, using the popular benchmark dataset CIFAR-10 and the acoustic scene classification task from the DCASE2020 challenge as representatives of computer vision and computer audition, respectively. We report a strong dependence of scoring functions on the training setting, including randomness, which can partly be mitigated through ensemble scoring. While we do not find a general advantage of CL over uniform sampling, we observe that the ordering in which data is presented for CL-based training plays an important role in model performance. Furthermore, we find that the robustness of scoring functions across random seeds positively correlates with CL performance. Finally, we uncover that models trained with different CL strategies complement each other by boosting predictive power through late fusion, likely due to differences in the learnt concepts. Alongside our findings, we release the aucurriculum toolkit (https://github.com/autrainer/aucurriculum), implementing sample difficulty and CL-based training in a modular fashion.

Does the Definition of Difficulty Matter? Scoring Functions and their Role for Curriculum Learning

TL;DR

An extensive study on the robustness and similarity of the most common scoring functions for sample difficulty estimation, as well as their potential benefits in CL, using the popular benchmark dataset CIFAR-10 and the acoustic scene classification task from the DCASE2020 challenge as representatives of computer vision and computer audition.

Abstract

Curriculum learning (CL) describes a machine learning training strategy in which samples are gradually introduced into the training process based on their difficulty. Despite a partially contradictory body of evidence in the literature, CL finds popularity in deep learning research due to its promise of leveraging human-inspired curricula to achieve higher model performance. Yet, the subjectivity and biases that follow any necessary definition of difficulty, especially for those found in orderings derived from models or training statistics, have rarely been investigated. To shed more light on the underlying unanswered questions, we conduct an extensive study on the robustness and similarity of the most common scoring functions for sample difficulty estimation, as well as their potential benefits in CL, using the popular benchmark dataset CIFAR-10 and the acoustic scene classification task from the DCASE2020 challenge as representatives of computer vision and computer audition, respectively. We report a strong dependence of scoring functions on the training setting, including randomness, which can partly be mitigated through ensemble scoring. While we do not find a general advantage of CL over uniform sampling, we observe that the ordering in which data is presented for CL-based training plays an important role in model performance. Furthermore, we find that the robustness of scoring functions across random seeds positively correlates with CL performance. Finally, we uncover that models trained with different CL strategies complement each other by boosting predictive power through late fusion, likely due to differences in the learnt concepts. Alongside our findings, we release the aucurriculum toolkit (https://github.com/autrainer/aucurriculum), implementing sample difficulty and CL-based training in a modular fashion.

Paper Structure

This paper contains 23 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Correlation of sf with increased ensemble sizes for the datasets CIFAR-10 and DCASE2020. The ensemble size encapsulates how many individual orderings -- obtained from different random seeds -- are considered to build one ensemble. For each sf, we report pairwise Spearman correlations across three ensemble orderings with the same ensemble size.
  • Figure 2: Mean pf performance on CIFAR-10 and DCASE2020, averaged across sf, saturation fractions, and three seeds. Each bar represents the average performance for each pf and curriculum ordering. The grey dashed vertical lines indicate the baseline performance, averaged across the 15 random seeds.
  • Figure 3: Comparison of sf robustness and cl performance for CIFAR-10 and DCASE2020. For each scoring function, we evaluate ensemble orderings of different ensemble sizes, noted as a number next to each point. The $y$-axis represents the pairwise correlation across the ensembles of the respective ensemble size (cf. $x$-axis in \ref{['fig:seed_aggregation_correlation']}) as an indicator of sf robustness. The $x$-axis displays the average accuracy of cl experiments based on the corresponding ensemble orderings. Coloured and grey dashed lines are linear least-squared-error fits per sf and across all sf, respectively. The slope of the lines indicates whether trends of higher sf robustness and higher cl performance exist.
  • Figure 4: Late fusion results for combinations of curriculum (cl), random curriculum (rcl), and anti-curriculum learning (acl), as well as the best baselines abbreviated with B; for instance, B4 represents the fusion of the best 4 baseline runs.
  • Figure 5: Agreement of different scoring functions with varying random seeds. Displayed is the pairwise Spearman correlation of the respective ensemble orderings of ensemble size six. The individual orderings building up the ensemble are obtained from the reference configuration and five additional variations of the random seed.
  • ...and 2 more figures