Table of Contents
Fetching ...

Realistic Evaluation of Deep Partial-Label Learning Algorithms

Wei Wang, Dong-Dong Wu, Jindong Wang, Gang Niu, Min-Ling Zhang, Masashi Sugiyama

TL;DR

This work tackles the reproducibility gap in deep partial-label learning by introducing PLENCH, a standardized benchmark that includes novel model-selection criteria with theoretical guarantees and a realistic, human-annotated image dataset PLCIFAR10. It shows that selecting hyperparameters with CR, AA, or OA can yield different outcomes, and that no single algorithm uniformly dominates across diverse real-world settings. The study demonstrates that simple, well-tuned methods can rival more complex, resource-intensive approaches, and emphasizes the need for realistic data and consistent evaluation protocols in PLL. Together, PLENCH and PLCIFAR10 offer a practical foundation for fair comparisons and progress toward robust PLL methods in real-world scenarios.

Abstract

Partial-label learning (PLL) is a weakly supervised learning problem in which each example is associated with multiple candidate labels and only one is the true label. In recent years, many deep PLL algorithms have been developed to improve model performance. However, we find that some early developed algorithms are often underestimated and can outperform many later algorithms with complicated designs. In this paper, we delve into the empirical perspective of PLL and identify several critical but previously overlooked issues. First, model selection for PLL is non-trivial, but has never been systematically studied. Second, the experimental settings are highly inconsistent, making it difficult to evaluate the effectiveness of the algorithms. Third, there is a lack of real-world image datasets that can be compatible with modern network architectures. Based on these findings, we propose PLENCH, the first Partial-Label learning bENCHmark to systematically compare state-of-the-art deep PLL algorithms. We investigate the model selection problem for PLL for the first time, and propose novel model selection criteria with theoretical guarantees. We also create Partial-Label CIFAR-10 (PLCIFAR10), an image dataset of human-annotated partial labels collected from Amazon Mechanical Turk, to provide a testbed for evaluating the performance of PLL algorithms in more realistic scenarios. Researchers can quickly and conveniently perform a comprehensive and fair evaluation and verify the effectiveness of newly developed algorithms based on PLENCH. We hope that PLENCH will facilitate standardized, fair, and practical evaluation of PLL algorithms in the future.

Realistic Evaluation of Deep Partial-Label Learning Algorithms

TL;DR

This work tackles the reproducibility gap in deep partial-label learning by introducing PLENCH, a standardized benchmark that includes novel model-selection criteria with theoretical guarantees and a realistic, human-annotated image dataset PLCIFAR10. It shows that selecting hyperparameters with CR, AA, or OA can yield different outcomes, and that no single algorithm uniformly dominates across diverse real-world settings. The study demonstrates that simple, well-tuned methods can rival more complex, resource-intensive approaches, and emphasizes the need for realistic data and consistent evaluation protocols in PLL. Together, PLENCH and PLCIFAR10 offer a practical foundation for fair comparisons and progress toward robust PLL methods in real-world scenarios.

Abstract

Partial-label learning (PLL) is a weakly supervised learning problem in which each example is associated with multiple candidate labels and only one is the true label. In recent years, many deep PLL algorithms have been developed to improve model performance. However, we find that some early developed algorithms are often underestimated and can outperform many later algorithms with complicated designs. In this paper, we delve into the empirical perspective of PLL and identify several critical but previously overlooked issues. First, model selection for PLL is non-trivial, but has never been systematically studied. Second, the experimental settings are highly inconsistent, making it difficult to evaluate the effectiveness of the algorithms. Third, there is a lack of real-world image datasets that can be compatible with modern network architectures. Based on these findings, we propose PLENCH, the first Partial-Label learning bENCHmark to systematically compare state-of-the-art deep PLL algorithms. We investigate the model selection problem for PLL for the first time, and propose novel model selection criteria with theoretical guarantees. We also create Partial-Label CIFAR-10 (PLCIFAR10), an image dataset of human-annotated partial labels collected from Amazon Mechanical Turk, to provide a testbed for evaluating the performance of PLL algorithms in more realistic scenarios. Researchers can quickly and conveniently perform a comprehensive and fair evaluation and verify the effectiveness of newly developed algorithms based on PLENCH. We hope that PLENCH will facilitate standardized, fair, and practical evaluation of PLL algorithms in the future.

Paper Structure

This paper contains 29 sections, 4 theorems, 12 equations, 4 figures, 16 tables.

Key Result

Proposition 1

Suppose that there is a constant $\epsilon\in\left(0,1\right)$ such that the expected accuracy of a classifier $\bm{f}$ satisfies ${\rm ACC}\left(\bm{f}\right)\geq\epsilon$. Then, we have $\mathbb{E}\left[{\rm CR}(\bm{f})\right]-{\rm ACC}\left(\bm{f}\right)\leq(1-\epsilon)\gamma$.

Figures (4)

  • Figure 1: The two left panels show the differences in using an ordinary-label dataset for validation (lighter colors) and training (darker colors) for a given algorithm. For validation (lighter colors), we searched for the best hyperparameter configurations with the validation set for a given algorithm. For training (darker colors), we considered the validation set as partial-label examples with a single partial label and added them to the training set for training, using default hyperparameters without tuning. For fair comparisons, we trained all models with the same number of iterations. The two right panels show the classification accuracies of some PLL algorithms on Soccer Player and Yahoo! News from papers A zhang20222exploiting and B xu2023progressive, respectively.
  • Figure 2: (a). The distribution of the collected partial labels of PLCIFAR10. (b) The noise rate with the increasing number of annotators. (c) The flipping probability matrix computed on PLCIFAR10-Aggregate. (d) The flipping probability matrix computed on PLCIFAR10-Vaguest.
  • Figure 3: Experimental results of different algorithms on tabular datasets. The top, middle, and bottom figures correspond to box plots of experimental results using CR, AA, OA, and OA with ES for hyperparameter tuning, respectively. The colors of the bars indicate the mean accuracy.
  • Figure 4: Running time and GPU memory utilization for each running step of different PLL algorithms on PLCIFAR10-Vaguest with DenseNet.

Theorems & Definitions (11)

  • Definition 1: Covering Rate (CR)
  • Definition 2: Ambiguity Degree
  • Proposition 1
  • Theorem 1
  • Definition 3: Approximated Accuracy (AA)
  • Theorem 2
  • Definition 4: Oracle Accuracy (OA)
  • proof
  • proof
  • Lemma 1: wu2023learning
  • ...and 1 more