Realistic Evaluation of Deep Partial-Label Learning Algorithms

Wei Wang; Dong-Dong Wu; Jindong Wang; Gang Niu; Min-Ling Zhang; Masashi Sugiyama

Realistic Evaluation of Deep Partial-Label Learning Algorithms

Wei Wang, Dong-Dong Wu, Jindong Wang, Gang Niu, Min-Ling Zhang, Masashi Sugiyama

TL;DR

This work tackles the reproducibility gap in deep partial-label learning by introducing PLENCH, a standardized benchmark that includes novel model-selection criteria with theoretical guarantees and a realistic, human-annotated image dataset PLCIFAR10. It shows that selecting hyperparameters with CR, AA, or OA can yield different outcomes, and that no single algorithm uniformly dominates across diverse real-world settings. The study demonstrates that simple, well-tuned methods can rival more complex, resource-intensive approaches, and emphasizes the need for realistic data and consistent evaluation protocols in PLL. Together, PLENCH and PLCIFAR10 offer a practical foundation for fair comparisons and progress toward robust PLL methods in real-world scenarios.

Abstract

Partial-label learning (PLL) is a weakly supervised learning problem in which each example is associated with multiple candidate labels and only one is the true label. In recent years, many deep PLL algorithms have been developed to improve model performance. However, we find that some early developed algorithms are often underestimated and can outperform many later algorithms with complicated designs. In this paper, we delve into the empirical perspective of PLL and identify several critical but previously overlooked issues. First, model selection for PLL is non-trivial, but has never been systematically studied. Second, the experimental settings are highly inconsistent, making it difficult to evaluate the effectiveness of the algorithms. Third, there is a lack of real-world image datasets that can be compatible with modern network architectures. Based on these findings, we propose PLENCH, the first Partial-Label learning bENCHmark to systematically compare state-of-the-art deep PLL algorithms. We investigate the model selection problem for PLL for the first time, and propose novel model selection criteria with theoretical guarantees. We also create Partial-Label CIFAR-10 (PLCIFAR10), an image dataset of human-annotated partial labels collected from Amazon Mechanical Turk, to provide a testbed for evaluating the performance of PLL algorithms in more realistic scenarios. Researchers can quickly and conveniently perform a comprehensive and fair evaluation and verify the effectiveness of newly developed algorithms based on PLENCH. We hope that PLENCH will facilitate standardized, fair, and practical evaluation of PLL algorithms in the future.

Realistic Evaluation of Deep Partial-Label Learning Algorithms

TL;DR

Abstract

Realistic Evaluation of Deep Partial-Label Learning Algorithms

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (11)