Table of Contents
Fetching ...

CatBoost: gradient boosting with categorical features support

Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin

TL;DR

CatBoost introduces a open-source gradient boosting framework that natively handles categorical features, combining ordered target encoding with permutation-based priors to reduce overfitting and a gradient-bias mitigation strategy that enables unbiased gradient estimates. It also features an efficient fast scorer based on oblivious trees and a GPU-optimized training pipeline using histogram-based splits, perfect hashing for categoricals, and multi-GPU support. Empirical results demonstrate superior predictive quality over XGBoost, LightGBM, and H2O across common benchmarks, along with substantial GPU-based speedups and faster model scoring. These contributions collectively offer practical advantages for real-world tasks with heterogeneous data and heavy categorical features.

Abstract

In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes.

CatBoost: gradient boosting with categorical features support

TL;DR

CatBoost introduces a open-source gradient boosting framework that natively handles categorical features, combining ordered target encoding with permutation-based priors to reduce overfitting and a gradient-bias mitigation strategy that enables unbiased gradient estimates. It also features an efficient fast scorer based on oblivious trees and a GPU-optimized training pipeline using histogram-based splits, perfect hashing for categoricals, and multi-GPU support. Empirical results demonstrate superior predictive quality over XGBoost, LightGBM, and H2O across common benchmarks, along with substantial GPU-based speedups and faster model scoring. These contributions collectively offer practical advantages for real-world tasks with heterogeneous data and heavy categorical features.

Abstract

In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes.

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: GPU learning curves