Table of Contents
Fetching ...

Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

Einari Vaaras, Manu Airaksinen, Okko Räsänen

Abstract

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

Abstract

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

Paper Structure

This paper contains 22 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An overview of the study. \ref{['fig:tsexplorer_poc_study_block_diagram']}: Overall study design. \ref{['fig:annotation_method_comparison_example_figure_simulated_data']}: An example of different sampling strategies. \ref{['fig:tsexplorer_poc_study_performance_evaluation_block_diagram']}: Model performance evaluation procedure.
  • Figure 2: An example screenshot of the TSExplorer GUI as used for posture annotation for IMA in the present study. On the top left of the GUI, there is a video widget that for video playback, and a scatter plot visualizing the entire dataset is located on the right. On the bottom left, there are two plots visualizing the accelerometer (right) and gyroscope (left) signals from the MAIJU-DS dataset. From top to bottom, the panels show the x, y, and z components for the right arm, left arm, right leg, and left leg. Users can select samples from the scatter plot for annotation in any order, and the annotation can be performed either via a drop-down menu or through keyboard shortcuts. The color of each data point in the scatter plot reflects its current label (green denoting unlabeled samples), and the active sample is highlighted with a large yellow ring. The color code as well as the keyboard shortcut for each class can be seen in the scatter plot legend.
  • Figure 3: The combined label distributions (±SD) for all annotators in terms of IMA (top left: posture; top right: movement) and SER (bottom left: valence; bottom right: arousal). Each label-wise histogram group is organized from left to right in the following order: MAIJU-DS or NICU-A GS reference distribution, RND, FAFT, and 2DV.
  • Figure 4: The mean classification results for IMA when training models with annotator-wise labels separately, organized by annotator group (dashed line: expert annotators; dotted line: non-expert annotators) and by classification task (left: posture; right: movement). A topline performance level (black horizontal line), derived from a model trained using MAIJU-DS labels, is shown for comparison.
  • Figure 5: The mean classification results for SER when training models with annotator-wise labels separately, organized by annotator group (dashed line: expert annotators; dotted line: non-expert annotators) and by classification task (left: valence; right: arousal). A reference performance level (black horizontal line), derived from a model trained using the labeled subset of the NICU-A training set, is shown for comparison.
  • ...and 5 more figures