Table of Contents
Fetching ...

Register Your Forests: Decision Tree Ensemble Optimization by Explicit CPU Register Allocation

Daniel Biebert, Christian Hakert, Kuan-Hsun Chen, Jian-Jia Chen

TL;DR

The paper tackles inefficiencies from general-purpose toolchains in deploying decision-tree ensembles on resource-constrained systems by proposing direct generation of architecture-specific assembly with explicit CPU register allocation. It analyzes native, if-else, and hybrid tree realizations, and introduces strategies for storing feature values and tree nodes in registers to maximize reuse. Empirical results show significant, architecture-sensitive speedups (up to around 1.6–1.66×) when register budgets are matched to the hardware, with guidance that native trees benefit from larger register pools (≈20) and if-else trees from smaller ones (≈10). The findings illustrate that register-aware code generation can materially improve inference runtime, while underscoring the need to tailor the approach to the target CPU and ensemble configuration; future work includes scheduling across ensembles to further expand the optimization space.

Abstract

Bringing high-level machine learning models to efficient and well-suited machine implementations often invokes a bunch of tools, e.g.~code generators, compilers, and optimizers. Along such tool chains, abstractions have to be applied. This leads to not optimally used CPU registers. This is a shortcoming, especially in resource constrained embedded setups. In this work, we present a code generation approach for decision tree ensembles, which produces machine assembly code within a single conversion step directly from the high-level model representation. Specifically, we develop various approaches to effectively allocate registers for the inference of decision tree ensembles. Extensive evaluations of the proposed method are conducted in comparison to the basic realization of C code from the high-level machine learning model and succeeding compilation. The results show that the performance of decision tree ensemble inference can be significantly improved (by up to $\approx1.6\times$), if the methods are applied carefully to the appropriate scenario.

Register Your Forests: Decision Tree Ensemble Optimization by Explicit CPU Register Allocation

TL;DR

The paper tackles inefficiencies from general-purpose toolchains in deploying decision-tree ensembles on resource-constrained systems by proposing direct generation of architecture-specific assembly with explicit CPU register allocation. It analyzes native, if-else, and hybrid tree realizations, and introduces strategies for storing feature values and tree nodes in registers to maximize reuse. Empirical results show significant, architecture-sensitive speedups (up to around 1.6–1.66×) when register budgets are matched to the hardware, with guidance that native trees benefit from larger register pools (≈20) and if-else trees from smaller ones (≈10). The findings illustrate that register-aware code generation can materially improve inference runtime, while underscoring the need to tailor the approach to the target CPU and ensemble configuration; future work includes scheduling across ensembles to further expand the optimization space.

Abstract

Bringing high-level machine learning models to efficient and well-suited machine implementations often invokes a bunch of tools, e.g.~code generators, compilers, and optimizers. Along such tool chains, abstractions have to be applied. This leads to not optimally used CPU registers. This is a shortcoming, especially in resource constrained embedded setups. In this work, we present a code generation approach for decision tree ensembles, which produces machine assembly code within a single conversion step directly from the high-level model representation. Specifically, we develop various approaches to effectively allocate registers for the inference of decision tree ensembles. Extensive evaluations of the proposed method are conducted in comparison to the basic realization of C code from the high-level machine learning model and succeeding compilation. The results show that the performance of decision tree ensemble inference can be significantly improved (by up to ), if the methods are applied carefully to the appropriate scenario.
Paper Structure (18 sections, 2 equations, 2 figures, 1 table)

This paper contains 18 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Native tree methods on X86 for server class (top) and desktop class (down) - 100 Trees with max. depth 15
  • Figure 2: If-else tree methods on X86 for server class (top) and desktop class (down) - 25 Trees with maximum depth 5