Table of Contents
Fetching ...

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data

Sergei Popov, Stanislav Morozov, Artem Babenko

TL;DR

This work introduces Neural Oblivious Decision Ensembles (NODE), a differentiable, end-to-end architecture for deep learning on heterogeneous tabular data that extends CatBoost-style oblivious decision trees into multi-layer networks. Each NODE layer uses differentiable oblivious trees whose routing is learned with the entmax transformation, enabling sparse, feature-efficient splits and along with a DenseNet-like multi-layer design that aggregates outputs across layers. The authors demonstrate that NODE frequently outperforms tuned gradient boosting methods (e.g., XGBoost, CatBoost) on a variety of tabular datasets, while maintaining competitive training and inference efficiency. The study contributes a full architectural paradigm, rigorous ablation analyses of decision-function choices, and an open-source PyTorch implementation for broad adoption in tabular-data tasks.

Abstract

Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both end-to-end gradient-based optimization and the power of multi-layer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We open-source the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data.

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data

TL;DR

This work introduces Neural Oblivious Decision Ensembles (NODE), a differentiable, end-to-end architecture for deep learning on heterogeneous tabular data that extends CatBoost-style oblivious decision trees into multi-layer networks. Each NODE layer uses differentiable oblivious trees whose routing is learned with the entmax transformation, enabling sparse, feature-efficient splits and along with a DenseNet-like multi-layer design that aggregates outputs across layers. The authors demonstrate that NODE frequently outperforms tuned gradient boosting methods (e.g., XGBoost, CatBoost) on a variety of tabular datasets, while maintaining competitive training and inference efficiency. The study contributes a full architectural paradigm, rigorous ablation analyses of decision-function choices, and an open-source PyTorch implementation for broad adoption in tabular-data tasks.

Abstract

Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both end-to-end gradient-based optimization and the power of multi-layer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We open-source the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data.

Paper Structure

This paper contains 16 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The single ODT inside the NODE layer. The splitting features and the splitting thresholds are shared across all the internal nodes of the same depth. The output is a sum of leaf responses scaled by the choice weights.
  • Figure 2: The NODE architecture, consisting of densely connected NODE layers. Each layer contains several trees whose outputs are concatenated and serve as input for the subsequent layer. The final prediction is obtained by averaging the outputs of all trees from all the layers.
  • Figure 3: NODE on UCI Higgs dataset: Left-Top: individual feature importance distributions for both original and learned features. Left-Bottom: mean absolute contribution of individual trees to the final response. Right: responses dependence on feature importances. See details in the text.