Table of Contents
Fetching ...

Diagnosing Medical Datasets with Training Dynamics

Laura Wenderoth

TL;DR

The evaluation indicates that the framework is unsuitable for addressing datasets' unique challenges in answering medical questions, and is unsuitable for addressing datasets' unique challenges in answering medical questions.

Abstract

This study explores the potential of using training dynamics as an automated alternative to human annotation for evaluating the quality of training data. The framework used is Data Maps, which classifies data points into categories such as easy-to-learn, hard-to-learn, and ambiguous (Swayamdipta et al., 2020). Swayamdipta et al. (2020) highlight that difficult-to-learn examples often contain errors, and ambiguous cases significantly impact model training. To confirm the reliability of these findings, we replicated the experiments using a challenging dataset, with a focus on medical question answering. In addition to text comprehension, this field requires the acquisition of detailed medical knowledge, which further complicates the task. A comprehensive evaluation was conducted to assess the feasibility and transferability of the Data Maps framework to the medical domain. The evaluation indicates that the framework is unsuitable for addressing datasets' unique challenges in answering medical questions.

Diagnosing Medical Datasets with Training Dynamics

TL;DR

The evaluation indicates that the framework is unsuitable for addressing datasets' unique challenges in answering medical questions, and is unsuitable for addressing datasets' unique challenges in answering medical questions.

Abstract

This study explores the potential of using training dynamics as an automated alternative to human annotation for evaluating the quality of training data. The framework used is Data Maps, which classifies data points into categories such as easy-to-learn, hard-to-learn, and ambiguous (Swayamdipta et al., 2020). Swayamdipta et al. (2020) highlight that difficult-to-learn examples often contain errors, and ambiguous cases significantly impact model training. To confirm the reliability of these findings, we replicated the experiments using a challenging dataset, with a focus on medical question answering. In addition to text comprehension, this field requires the acquisition of detailed medical knowledge, which further complicates the task. A comprehensive evaluation was conducted to assess the feasibility and transferability of the Data Maps framework to the medical domain. The evaluation indicates that the framework is unsuitable for addressing datasets' unique challenges in answering medical questions.

Paper Structure

This paper contains 19 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The ranking displays how well various models performed in answering medical questions, as evaluated against the MedQA benchmark. The ranking is determined by accuracy and offers a thorough assessment of each model's effectiveness in answering medical questions. Chart borrowed from Medqabenchmark.
  • Figure 2: Data Maps analysis on the MedQA dataset using the performance of the ROBERTA-large classifier over 20 epochs with 182,822 training samples. Only 25,000 samples are displayed for clarity. The x-axis indicates variability, ranging from low to high, while the y-axis represents the confidence levels of the classifier. The visualisation uses different colours and shapes to indicate correctness. Red triangles represent easy-to-learn examples with low variability and high confidence, blue circles represent hard-to-learn instances with low variability and low confidence, and black pluses represent ambiguous cases with high variability. This intuitive representation provides a comprehensive overview of the dataset with respect to the classifier.
  • Figure 3: Data Maps analysis on the MedQA dataset using the performance of the ROBERTA-large classifier like in Figure \ref{['fig:datamet-main-roberta']}. The training dynamics were only calculated for five epochs instead of the full 20. This causes the data points to be closer together, making them harder to distinguish.
  • Figure 4: RoBERTa-large model performance over training epochs. The plot indicates that the model is overfitting on the training data, resulting in no improvement on the validation data. The training accuracy increased rapidly during the training process, reaching a maximum of 87.7% in the 20th epoch. In contrast, the validation accuracy initially increased up to epoch 5, followed by fluctuation, and reached a maximum of 35.74% at epoch 19.