Table of Contents
Fetching ...

Gradient Boosting Reinforcement Learning

Benjamin Fuhrer, Chen Tessler, Gal Dalal

TL;DR

GBRL tackles the challenge of applying gradient boosting trees to reinforcement learning by treating the tree ensemble as a function-parameterization of the policy and value functions and updating it via functional gradients. The approach interleaves tree construction with environment interaction, enabling incremental learning suitable for RL’s non-stationary data. Key contributions include a gradient-based GBRL framework with a shared actor-critic architecture, a CUDA-accelerated implementation, and empirical evidence that GBRL outperforms neural networks on structured observation tasks while offering improved robustness to out-of-distribution signals and spurious correlations. The work demonstrates that GBTs, with their strength on structured data, can serve as competitive or superior function approximators in RL domains where such features are prevalent, while also outlining limitations and directions for future work. Overall, GBRL broadens the RL toolbox by bringing gradient-boosted ensembles into online, interactive learning contexts with practical hardware acceleration and integration capabilities.

Abstract

We present Gradient Boosting Reinforcement Learning (GBRL), a framework that adapts the strengths of gradient boosting trees (GBT) to reinforcement learning (RL) tasks. While neural networks (NNs) have become the de facto choice for RL, they face significant challenges with structured and categorical features and tend to generalize poorly to out-of-distribution samples. These are challenges for which GBTs have traditionally excelled in supervised learning. However, GBT's application in RL has been limited. The design of traditional GBT libraries is optimized for static datasets with fixed labels, making them incompatible with RL's dynamic nature, where both state distributions and reward signals evolve during training. GBRL overcomes this limitation by continuously interleaving tree construction with environment interaction. Through extensive experiments, we demonstrate that GBRL outperforms NNs in domains with structured observations and categorical features while maintaining competitive performance on standard continuous control benchmarks. Like its supervised learning counterpart, GBRL demonstrates superior robustness to out-of-distribution samples and better handles irregular state-action relationships.

Gradient Boosting Reinforcement Learning

TL;DR

GBRL tackles the challenge of applying gradient boosting trees to reinforcement learning by treating the tree ensemble as a function-parameterization of the policy and value functions and updating it via functional gradients. The approach interleaves tree construction with environment interaction, enabling incremental learning suitable for RL’s non-stationary data. Key contributions include a gradient-based GBRL framework with a shared actor-critic architecture, a CUDA-accelerated implementation, and empirical evidence that GBRL outperforms neural networks on structured observation tasks while offering improved robustness to out-of-distribution signals and spurious correlations. The work demonstrates that GBTs, with their strength on structured data, can serve as competitive or superior function approximators in RL domains where such features are prevalent, while also outlining limitations and directions for future work. Overall, GBRL broadens the RL toolbox by bringing gradient-boosted ensembles into online, interactive learning contexts with practical hardware acceleration and integration capabilities.

Abstract

We present Gradient Boosting Reinforcement Learning (GBRL), a framework that adapts the strengths of gradient boosting trees (GBT) to reinforcement learning (RL) tasks. While neural networks (NNs) have become the de facto choice for RL, they face significant challenges with structured and categorical features and tend to generalize poorly to out-of-distribution samples. These are challenges for which GBTs have traditionally excelled in supervised learning. However, GBT's application in RL has been limited. The design of traditional GBT libraries is optimized for static datasets with fixed labels, making them incompatible with RL's dynamic nature, where both state distributions and reward signals evolve during training. GBRL overcomes this limitation by continuously interleaving tree construction with environment interaction. Through extensive experiments, we demonstrate that GBRL outperforms NNs in domains with structured observations and categorical features while maintaining competitive performance on standard continuous control benchmarks. Like its supervised learning counterpart, GBRL demonstrates superior robustness to out-of-distribution samples and better handles irregular state-action relationships.
Paper Structure (45 sections, 6 equations, 20 figures, 6 tables)

This paper contains 45 sections, 6 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: The GBRL framework. The actor's policy and critic's value function are parameterized by the tree ensemble $F_k$. For example, $F_k(\mathop{\mathrm{s}}\nolimits) = [\mu(\mathop{\mathrm{s}}\nolimits), \sigma(\mathop{\mathrm{s}}\nolimits), V(\mathop{\mathrm{s}}\nolimits)]$ for a Gaussian policy. Starting from $F_0$, at each training iteration, $k$, GBRL collects a rollout and computes the gradient $\nabla_{\pi_{F_k}}J(\pi_{F_k})$ with respect to the current ensemble. This gradient is then used to fit the next tree. Adding the tree to the ensemble updates it to $F_k(\mathop{\mathrm{s}}\nolimits) = F_{k-1}(\mathop{\mathrm{s}}\nolimits) + \epsilon h_k(\mathop{\mathrm{s}}\nolimits)$, where $\epsilon$ is the learning rate.
  • Figure 2: GBT library comparison, Cartpole. CatBoost and XGBoost are intractable in RL. CatBoost's lack of GPU support for custom losses leads to low FPS and early termination.
  • Figure 3: Shared Actor-Critic, MiniGrid. The shared tree structure significantly increases efficiency, without impacting the score. Aggregated results over three tasks are shown here, full per-task curves are available in Appendix (\ref{['fig:ablation_env']})
  • Figure 4: GBRL vs NN in standard environments (PPO). Aggregated mean and standard deviation of the normalized average reward for the final 100 episodes.
  • Figure 5: Signal dilution, variable isolation task. Mean and standard deviation of the average episodic reward during training. GBRL was trained for 15M training steps and NN for 30M. Episodes are terminated after 50 steps if not solved.
  • ...and 15 more figures