CatBoost: gradient boosting with categorical features support
Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin
TL;DR
CatBoost introduces a open-source gradient boosting framework that natively handles categorical features, combining ordered target encoding with permutation-based priors to reduce overfitting and a gradient-bias mitigation strategy that enables unbiased gradient estimates. It also features an efficient fast scorer based on oblivious trees and a GPU-optimized training pipeline using histogram-based splits, perfect hashing for categoricals, and multi-GPU support. Empirical results demonstrate superior predictive quality over XGBoost, LightGBM, and H2O across common benchmarks, along with substantial GPU-based speedups and faster model scoring. These contributions collectively offer practical advantages for real-world tasks with heterogeneous data and heavy categorical features.
Abstract
In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes.
