Table of Contents
Fetching ...

Filtering out mislabeled training instances using black-box optimization and quantum annealing

Makoto Otsuka, Kento Kodama, Keisuke Morita, Masayuki Ohzeki

TL;DR

The effectiveness of the proposed approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization with postprocessing and quantum annealing is highlighted, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.

Abstract

This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method's ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave's clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij's simulated quantum annealing sampler or Neal's simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.

Filtering out mislabeled training instances using black-box optimization and quantum annealing

TL;DR

The effectiveness of the proposed approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization with postprocessing and quantum annealing is highlighted, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.

Abstract

This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method's ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave's clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij's simulated quantum annealing sampler or Neal's simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.
Paper Structure (17 sections, 1 equation, 9 figures, 1 algorithm)

This paper contains 17 sections, 1 equation, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Training, validation and test datasets for the noisy majority bit task. For any dataset, each column represents a specific instance composed of 9 input bits $\mathbf{x} \in \{0, 1\}^{9}$ and 1 target bit $t \in \{0, 1\}$, which are separated by the horizontal dashed blue line. 64, 128, and 128 non-overlapping binary patterns were randomly sampled without repetition to create the input patterns for the training, validation, and test sets, respectively. Each input pattern in the validation and test sets was correctly labeled based on the majority bit of the 9 input bits. The training dataset consists of two halves, separated by a vertical dashed red line. The first half contains 64 real training instances correctly labeled by the majority bits, while the latter half contains 64 fake training instances incorrectly labeled by the minority bits. These incorrectly-labeled fake training instances should be removed for successful training.
  • Figure 2: Instance removal patterns acquired by three different samplers over 32 runs. Each cell in the grid represents the inclusion or exclusion status of a training instance for a specific run. For any sampler, a white cell at the $r$-th row and $i$-th column indicates the $i$-th training instance is included in the filtered training set and used for actual training in the $r$-th run. The black cell, on the other hand, shows the same training instance is excluded from the filtered training set and not used for training in the associated run. The red dotted line marks the border between clean and mislabeled training instances. Seed values are changed from 0 to 31. This consistent filtering pattern provides clear evidence that the proposed algorithm can reliably identify mislabeled instances while preserving clean instances.
  • Figure 3: Performance of the trained model on four different datasets. In each run, a model was trained using a filtered training dataset and then evaluated on four different datasets. The model performance was measured in log-loss error metric. Each gray dot represents one of the 32 log-loss errors for a specific dataset. Variations in error values within the same dataset are due to differences in the training instances selected across different runs by our dataset optimization algorithm. The labels Train (all) and Train (optimized) indicate whether all or filtered instances in the training dataset were used for model evaluation, respectively. The whiskers extend to the farthest points within 1.5 times the inter-quartile range from the nearest hinge of the box plot. The two box plots on the left confirm that training errors are consistently reduced after optimizing the training dataset in all 32 runs. The two box plots on the right further indicate that test errors remain low and comparable to directly optimized validation errors, offering strong evidence that training with the optimized dataset enhances generalization performance.
  • Figure 4: Characteristics of the input patterns of the fake training instances and their removal probabilities using the D-Wave QA clique sampler. 1) The 9 dimensional binary input patterns of the fake training instances. 2) The summed values of input patterns, with the horizontal dashed red line marking the optimal decision boundary of 4.5. 3) The absolute deviance (or distance) of summed input patterns from the optimal decision boundary of 4.5. 4) The removal probabilities of incorrectly labeled fake instances over 32 runs with different seed values ranging from 0 to 31. 5) The removal patterns of fake training instances across 32 different runs. This removal pattern is also depicted on the right side of the dashed red line in the D-Wave QA clique result shown in Fig. \ref{['fig:x_bests_for_different_seeds']}. The relationship between characteristics of input patterns and their removal probabilities is further analyzed in Fig. \ref{['fig:removal_probabilities_and_input_characteristics']}, highlighting how input features influences the likelihood of removal.
  • Figure 5: Relationship between the removal probabilities and the characteristics of 64 input patterns of the fake training subset. The removal probabilities are compared with (a) the number of ones in each input pattern of the fake training subset and (b) the absolute deviance from the optimal threshold of 4.5 to the summed values of input patterns. In both plots, the whiskers are drawn to the farthest points within 1.5 times the inter-quartile range from the nearest hinge of the box plot. Both figures clearly illustrate that mislabeled training instances with greater detrimental impact on the trained model---for example, a zero vector mislabeled as class 1---are more likely to be removed.
  • ...and 4 more figures