Table of Contents
Fetching ...

Evaluating Fine-Tuning Efficiency of Human-Inspired Learning Strategies in Medical Question Answering

Yushi Yang, Andrew M. Bean, Robert McCraith, Adam Mahdi

TL;DR

LLM-defined question difficulty outperforms human-defined labels in curriculum-based learning, showing the potential of model-generated data as a cost-effective alternative for optimising fine-tuning.

Abstract

Fine-tuning Large Language Models (LLMs) incurs considerable training costs, driving the need for data-efficient training with optimised data ordering. Human-inspired strategies offer a solution by organising data based on human learning practices. This study evaluates the fine-tuning efficiency of five human-inspired strategies across four language models, three datasets, and both human- and LLM-labelled data in the context of medical question answering. These strategies achieve the best accuracy gain of 1.81% and an average gain of 1.02% across datasets, with interleaved strategies delivering the best average results. However, the best strategy varies across model-dataset combinations, limiting the generalisability of the effects of any single strategy. Additionally, LLM-defined question difficulty outperforms human-defined labels in curriculum-based learning, showing the potential of model-generated data as a cost-effective alternative for optimising fine-tuning.

Evaluating Fine-Tuning Efficiency of Human-Inspired Learning Strategies in Medical Question Answering

TL;DR

LLM-defined question difficulty outperforms human-defined labels in curriculum-based learning, showing the potential of model-generated data as a cost-effective alternative for optimising fine-tuning.

Abstract

Fine-tuning Large Language Models (LLMs) incurs considerable training costs, driving the need for data-efficient training with optimised data ordering. Human-inspired strategies offer a solution by organising data based on human learning practices. This study evaluates the fine-tuning efficiency of five human-inspired strategies across four language models, three datasets, and both human- and LLM-labelled data in the context of medical question answering. These strategies achieve the best accuracy gain of 1.81% and an average gain of 1.02% across datasets, with interleaved strategies delivering the best average results. However, the best strategy varies across model-dataset combinations, limiting the generalisability of the effects of any single strategy. Additionally, LLM-defined question difficulty outperforms human-defined labels in curriculum-based learning, showing the potential of model-generated data as a cost-effective alternative for optimising fine-tuning.
Paper Structure (25 sections, 1 equation, 3 figures, 5 tables)

This paper contains 25 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Human-inspired learning strategies. Along the Random Shuffle baseline, five human-inspired learning strategies order data based on question difficulty (shown with arrows) and question category (represented by block colors). The first row shows non-curriculum-based strategies, while the second row shows curriculum-based strategies.
  • Figure 2: Comparison of accuracy gains from learning strategies with human- and LLM-defined difficulty. (a) shows the highest and average accuracy gains (in %) of the best strategy across models and datasets, compared to the Random Shuffle baseline. (b) shows that using LLM-defined difficulty improves accuracy scores for all curriculum-based strategies, with each bar showing the mean accuracy gains (in %) across all model-data combinations.
  • Figure 3: Mean accuracy gains for the learning strategies. Each bar plot shows the mean accuracy gains (in %) of the learning strategies when averaged across model-dataset combinations. Figures (a)-(c) represent three data labelling scenarios.