Table of Contents
Fetching ...

Decision Tree Induction Through LLMs via Semantically-Aware Evolution

Tennison Liu, Nicolas Huynh, Mihaela van der Schaar

TL;DR

This work tackles the NP-hard problem of decision tree induction by introducing LLEGO, a semantically aware GP framework that uses Large Language Models to infuse semantic priors into genetic operators. It adds two operators—fitness-guided crossover and diversity-guided mutation—controlled by hyperparameters $\alpha$ and $\tau$ to balance exploitation and exploration, enabling more efficient search over larger tree spaces. Empirically, LLEGO evolves trees with superior generalization across diverse tabular benchmarks and achieves better search efficiency than traditional GP methods, often with fewer evaluations. The approach highlights the potential of combining LLM-driven semantics with evolutionary search to improve interpretability and performance in structured prediction tasks while acknowledging higher computational costs.

Abstract

Decision trees are a crucial class of models offering robust predictive performance and inherent interpretability across various domains, including healthcare, finance, and logistics. However, current tree induction methods often face limitations such as suboptimal solutions from greedy methods or prohibitive computational costs and limited applicability of exact optimization approaches. To address these challenges, we propose an evolutionary optimization method for decision tree induction based on genetic programming (GP). Our key innovation is the integration of semantic priors and domain-specific knowledge about the search space into the optimization algorithm. To this end, we introduce $\texttt{LLEGO}$, a framework that incorporates semantic priors into genetic search operators through the use of Large Language Models (LLMs), thereby enhancing search efficiency and targeting regions of the search space that yield decision trees with superior generalization performance. This is operationalized through novel genetic operators that work with structured natural language prompts, effectively utilizing LLMs as conditional generative models and sources of semantic knowledge. Specifically, we introduce $\textit{fitness-guided}$ crossover to exploit high-performing regions, and $\textit{diversity-guided}$ mutation for efficient global exploration of the search space. These operators are controlled by corresponding hyperparameters that enable a more nuanced balance between exploration and exploitation across the search space. Empirically, we demonstrate across various benchmarks that $\texttt{LLEGO}$ evolves superior-performing trees compared to existing tree induction methods, and exhibits significantly more efficient search performance compared to conventional GP approaches.

Decision Tree Induction Through LLMs via Semantically-Aware Evolution

TL;DR

This work tackles the NP-hard problem of decision tree induction by introducing LLEGO, a semantically aware GP framework that uses Large Language Models to infuse semantic priors into genetic operators. It adds two operators—fitness-guided crossover and diversity-guided mutation—controlled by hyperparameters and to balance exploitation and exploration, enabling more efficient search over larger tree spaces. Empirically, LLEGO evolves trees with superior generalization across diverse tabular benchmarks and achieves better search efficiency than traditional GP methods, often with fewer evaluations. The approach highlights the potential of combining LLM-driven semantics with evolutionary search to improve interpretability and performance in structured prediction tasks while acknowledging higher computational costs.

Abstract

Decision trees are a crucial class of models offering robust predictive performance and inherent interpretability across various domains, including healthcare, finance, and logistics. However, current tree induction methods often face limitations such as suboptimal solutions from greedy methods or prohibitive computational costs and limited applicability of exact optimization approaches. To address these challenges, we propose an evolutionary optimization method for decision tree induction based on genetic programming (GP). Our key innovation is the integration of semantic priors and domain-specific knowledge about the search space into the optimization algorithm. To this end, we introduce , a framework that incorporates semantic priors into genetic search operators through the use of Large Language Models (LLMs), thereby enhancing search efficiency and targeting regions of the search space that yield decision trees with superior generalization performance. This is operationalized through novel genetic operators that work with structured natural language prompts, effectively utilizing LLMs as conditional generative models and sources of semantic knowledge. Specifically, we introduce crossover to exploit high-performing regions, and mutation for efficient global exploration of the search space. These operators are controlled by corresponding hyperparameters that enable a more nuanced balance between exploration and exploitation across the search space. Empirically, we demonstrate across various benchmarks that evolves superior-performing trees compared to existing tree induction methods, and exhibits significantly more efficient search performance compared to conventional GP approaches.

Paper Structure

This paper contains 42 sections, 5 equations, 24 figures, 15 tables.

Figures (24)

  • Figure 1: $\texttt{LLEGO}$ Overview. In each generation $g \in [G]$, a population of trees $\mathcal{P}^{(g)}$ is evolved through crossover$\mathcal{O}_{\textrm{MUT}}=\texttt{LLEGO}_{\textrm{XO}}(\mathcal{P}^{(g)}, \mathcal{C}; \alpha)$ and mutation$\mathcal{O}_{\textrm{MUT}}=\texttt{LLEGO}_{\textrm{MUT}}(\mathcal{P}^{(g)}, \mathcal{C}; \tau)$. Subsequently, the offsprings $\mathcal{O}_{\textrm{XO}} \cup \mathcal{O}_{\textrm{MUT}}$ are evaluated for fitness on $\mathcal{D}_\text{train}$; and selection preserves the top-$N$ trees, $\mathcal{P}^{(g+1)}\leftarrow\texttt{SEL}(\tilde{\mathcal{P}}^{(g+1)}, N)$, where $\tilde{\mathcal{P}}^{(g)}=\mathcal{P}^{(g)}\cup\mathcal{O}_{\textrm{XO}}\cup\mathcal{O}_{\textrm{MUT}}$.
  • Figure 2: $\texttt{LLEGO}_{\textrm{XO}}$. In each operation, the crossover operator (1) samples a set of parents $\mathcal{S}$ weighted by their fitness, (2) computes the desired fitness $f^*$ using $\mathcal{S}$ and $\alpha$, and (3) samples offspring via the LLM.
  • Figure 2: Performance on regression tasks. MSE ($\downarrow$) across $5$ regression datasets, best results emboldened.
  • Figure 3: $\texttt{LLEGO}_{\textrm{MUT}}$. In each operation, the mutation operator: (1) samples a set of $\lambda'$ candidate offsprings $\tilde{\mathcal{O}}$, (2) computes sampling weights, $\theta$, inversely proportional to logprobs, with temperature $\tau$ controlling diversity, and (3) sample offspring via this weighted distribution $\texttt{Cat}_{\lambda'}(\theta)$.
  • Figure 4: Search efficiency. Median fitness and diversity across $25$ generations.
  • ...and 19 more figures