Table of Contents
Fetching ...

Open set label noise learning with robust sample selection and margin-guided module

Yuandi Zhao, Qianxi Xia, Yang Sun, Zhijie Wen, Liyan Ma, Shihui Ying

TL;DR

This work addresses open-set label noise in deep learning for computer vision by proposing RSS-MGM, a framework that combines robust sample selection with a margin-guided module to separate clean data, in-distribution (ID) noisy samples, and out-of-distribution (OOD) samples. It enlarges the clean subset via small-loss and high-confidence criteria and uses margin-based rules to identify OOD samples while reusing high-confidence ID samples through semi-supervised strategies. The method employs a tailored loss design with label smoothing for clean data, sharpening for high-confidence ID samples, and a consistency regularization term to stabilize predictions across augmentations, achieving state-of-the-art results on CIFAR100N-C, CIFAR80N-O, WebFG-496, and Food101N, particularly in OLNL settings. The approach provides a practical, scalable solution for real-world noisy data, improving discrimination between open-set and closed-set noise and enhancing model robustness.

Abstract

In recent years, the remarkable success of deep neural networks (DNNs) in computer vision is largely due to large-scale, high-quality labeled datasets. Training directly on real-world datasets with label noise may result in overfitting. The traditional method is limited to deal with closed set label noise, where noisy training data has true class labels within the known label space. However, there are some real-world datasets containing open set label noise, which means that some samples belong to an unknown class outside the known label space. To address the open set label noise problem, we introduce a method based on Robust Sample Selection and Margin-Guided Module (RSS-MGM). Firstly, unlike the prior clean sample selection approach, which only select a limited number of clean samples, a robust sample selection module combines small loss selection or high-confidence sample selection to obtain more clean samples. Secondly, to efficiently distinguish open set label noise and closed set ones, margin functions are designed to filter open-set data and closed set data. Thirdly, different processing methods are selected for different types of samples in order to fully utilize the data's prior information and optimize the whole model. Furthermore, extensive experimental results with noisy labeled data from benchmark datasets and real-world datasets, such as CIFAR-100N-C, CIFAR80N-O, WebFG-469, and Food101N, indicate that our approach outperforms many state-of-the-art label noise learning methods. Especially, it can more accurately divide open set label noise samples and closed set ones.

Open set label noise learning with robust sample selection and margin-guided module

TL;DR

This work addresses open-set label noise in deep learning for computer vision by proposing RSS-MGM, a framework that combines robust sample selection with a margin-guided module to separate clean data, in-distribution (ID) noisy samples, and out-of-distribution (OOD) samples. It enlarges the clean subset via small-loss and high-confidence criteria and uses margin-based rules to identify OOD samples while reusing high-confidence ID samples through semi-supervised strategies. The method employs a tailored loss design with label smoothing for clean data, sharpening for high-confidence ID samples, and a consistency regularization term to stabilize predictions across augmentations, achieving state-of-the-art results on CIFAR100N-C, CIFAR80N-O, WebFG-496, and Food101N, particularly in OLNL settings. The approach provides a practical, scalable solution for real-world noisy data, improving discrimination between open-set and closed-set noise and enhancing model robustness.

Abstract

In recent years, the remarkable success of deep neural networks (DNNs) in computer vision is largely due to large-scale, high-quality labeled datasets. Training directly on real-world datasets with label noise may result in overfitting. The traditional method is limited to deal with closed set label noise, where noisy training data has true class labels within the known label space. However, there are some real-world datasets containing open set label noise, which means that some samples belong to an unknown class outside the known label space. To address the open set label noise problem, we introduce a method based on Robust Sample Selection and Margin-Guided Module (RSS-MGM). Firstly, unlike the prior clean sample selection approach, which only select a limited number of clean samples, a robust sample selection module combines small loss selection or high-confidence sample selection to obtain more clean samples. Secondly, to efficiently distinguish open set label noise and closed set ones, margin functions are designed to filter open-set data and closed set data. Thirdly, different processing methods are selected for different types of samples in order to fully utilize the data's prior information and optimize the whole model. Furthermore, extensive experimental results with noisy labeled data from benchmark datasets and real-world datasets, such as CIFAR-100N-C, CIFAR80N-O, WebFG-469, and Food101N, indicate that our approach outperforms many state-of-the-art label noise learning methods. Especially, it can more accurately divide open set label noise samples and closed set ones.
Paper Structure (26 sections, 26 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 26 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: The example of open set label noise problem. Collecting a network dataset with labels from the CIFAR10 dataset but different images. Each image in the dataset is divided into three different groups: Clean Set, Closed Set, and Open Set. Clean Set refers to images with correct labels. Closed Set signifies images that are labeled incorrectly, yet their correct labels still exist within the known label space. Open Set indicates that the image is incorrectly labeled and the ground-truth label is outside the label space.
  • Figure 2: The overall framework of RSS-MGM. Each input image $x_i$ is augmented into one weakly view and one strongly augmented view before being fed into the label predictor network, leading to two label predictions: $p^w$ for the weakly augmented view and $p^s$ for the strongly augmented view. Afterward, based on the Robust Sample Selection Module, samples are classified as Clean Set or Noisy Set. If the sample is clean, it will be fed into the label prediction network. Otherwise, based on the Margin-Guided Module, samples are divided into ID Set or OOD Set. Samples from the ID Set will be re-labeled to update the network, while samples from the OOD Set will be directly discarded. Finally, our model is updated by back-propagating.
  • Figure 3: An example of the small loss selection method for dividing the training dataset. Here, Clean Set represents the set of samples that are identified as clean samples, which are considered to be labeled accurately. The Noise Set, on the other hand, represents the set of samples with noisy labels. However, the samples bordered in red in the noise set are actually clean, but are incorrectly classified in the noise set.