Table of Contents
Fetching ...

Training Gradient Boosted Decision Trees on Tabular Data Containing Label Noise for Classification Tasks

Anita Eisenbürger, Daniel Otten, Anselm Hudde, Frank Hopfgartner

TL;DR

This work addresses the robustness of gradient-boosted decision trees (GBDTs) to label noise in tabular data, a domain where such studies are scarce. It develops four noise-detection methods—two adapted from deep learning (LRT-Correction, AUM Ranking) and two novel, including Gradients—together with two noise-correction strategies (removal, relabeling) and applies them within a careful experimental framework. Key findings show that GBDTs exhibit natural robustness to symmetric noise, early stopping mitigates overfitting, and detection methods like AUM and LRT achieve state-of-the-art noise-detection accuracy on datasets such as Adult, albeit with dataset-dependent performance. The results highlight the importance of dataset characteristics and suggest dynamic, thresholded approaches (e.g., Gaussian Mixture Models) to adapt noise handling without extensive tuning, laying groundwork for robust GBDT training under label noise in tabular domains.

Abstract

Label noise, which refers to the mislabeling of instances in a dataset, can significantly impair classifier performance, increase model complexity, and affect feature selection. While most research has concentrated on deep neural networks for image and text data, this study explores the impact of label noise on gradient-boosted decision trees (GBDTs), the leading algorithm for tabular data. This research fills a gap by examining the robustness of GBDTs to label noise, focusing on adapting two noise detection methods from deep learning for use with GBDTs and introducing a new detection method called Gradients. Additionally, we extend a method initially designed for GBDTs to incorporate relabeling. By using diverse datasets such as Covertype and Breast Cancer, we systematically introduce varying levels of label noise and evaluate the effectiveness of early stopping and noise detection methods in maintaining model performance. Our noise detection methods achieve state-of-the-art results, with a noise detection accuracy above 99% on the Adult dataset across all noise levels. This work enhances the understanding of label noise in GBDTs and provides a foundation for future research in noise detection and correction methods.

Training Gradient Boosted Decision Trees on Tabular Data Containing Label Noise for Classification Tasks

TL;DR

This work addresses the robustness of gradient-boosted decision trees (GBDTs) to label noise in tabular data, a domain where such studies are scarce. It develops four noise-detection methods—two adapted from deep learning (LRT-Correction, AUM Ranking) and two novel, including Gradients—together with two noise-correction strategies (removal, relabeling) and applies them within a careful experimental framework. Key findings show that GBDTs exhibit natural robustness to symmetric noise, early stopping mitigates overfitting, and detection methods like AUM and LRT achieve state-of-the-art noise-detection accuracy on datasets such as Adult, albeit with dataset-dependent performance. The results highlight the importance of dataset characteristics and suggest dynamic, thresholded approaches (e.g., Gaussian Mixture Models) to adapt noise handling without extensive tuning, laying groundwork for robust GBDT training under label noise in tabular domains.

Abstract

Label noise, which refers to the mislabeling of instances in a dataset, can significantly impair classifier performance, increase model complexity, and affect feature selection. While most research has concentrated on deep neural networks for image and text data, this study explores the impact of label noise on gradient-boosted decision trees (GBDTs), the leading algorithm for tabular data. This research fills a gap by examining the robustness of GBDTs to label noise, focusing on adapting two noise detection methods from deep learning for use with GBDTs and introducing a new detection method called Gradients. Additionally, we extend a method initially designed for GBDTs to incorporate relabeling. By using diverse datasets such as Covertype and Breast Cancer, we systematically introduce varying levels of label noise and evaluate the effectiveness of early stopping and noise detection methods in maintaining model performance. Our noise detection methods achieve state-of-the-art results, with a noise detection accuracy above 99% on the Adult dataset across all noise levels. This work enhances the understanding of label noise in GBDTs and provides a foundation for future research in noise detection and correction methods.
Paper Structure (27 sections, 3 equations, 12 figures, 7 tables)

This paper contains 27 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Noise transition matrices for a dataset with four classes on no noise, 20% symmetric and 20% pair noise, respectively.
  • Figure 2: Maximum absolute gradients of noisy and clean instances per epoch (Covertype). Noisy instances exhibit significantly larger gradients.
  • Figure 3: The types of predictions the model makes during training at 10% noise (Dry Bean). Only instances where the true label deviates from the noisy label are shown. The model predicts mostly the ground-truth true label in the beginning and gradually adapts to the noisy labels.
  • Figure 4: Classification accuracy on the train and test set per epoch at 30% noise without early stopping (Dry Bean). Test set accuracy on the clean test set decreases much slower when trained on symmetric noise, implying that GBDTs are more robust towards symmetric noise.
  • Figure 5: Training curves at 10% and 40% pair noise respectively (Dry Bean). The difference in performance is smaller on the test set than the train set, implying the model is also somewhat robust to pair noise.
  • ...and 7 more figures