Table of Contents
Fetching ...

LLM Meeting Decision Trees on Tabular Data

Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, Yi Chang

TL;DR

DeLTa introduces a data-privacy-friendly framework that integrates LLM reasoning with decision-tree rules for tabular prediction. By refining an ensemble of tree rules via an LLM-derived rule and applying a gradient-based residual correction, it achieves state-of-the-art results without requiring LLM fine-tuning or tabular data serialization. The approach is validated across diverse datasets, demonstrating strong performance in both full-data and few-shot settings, along with favorable computational efficiency. The work highlights a principled way to combine expert rule structures with large-language model reasoning to improve tabular learning while preserving privacy. This has practical implications for healthcare, finance, and other domains where data privacy and sample efficiency are critical.

Abstract

Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc. With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed. Most of these LLM-based methods typically first serialize tabular data into natural language descriptions, and then tune LLMs or directly infer on these serialized data. However, these methods suffer from two key inherent issues: (i) data perspective: existing data serialization methods lack universal applicability for structured tabular data, and may pose privacy risks through direct textual exposure, and (ii) model perspective: LLM fine-tuning methods struggle with tabular data, and in-context learning scalability is bottle-necked by input length constraints (suitable for few-shot learning). This work explores a novel direction of integrating LLMs into tabular data throughough logical decision tree rules as intermediaries, proposes a decision tree enhancer with LLM-derived rule for tabular prediction, DeLTa. The proposed DeLTa avoids tabular data serialization, and can be applied to full data learning setting without LLM fine-tuning. Specifically, we leverage the reasoning ability of LLMs to redesign an improved rule given a set of decision tree rules. Furthermore, we provide a calibration method for original decision trees via new generated rule by LLM, which approximates the error correction vector to steer the original decision tree predictions in the direction of ``errors'' reducing. Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance.

LLM Meeting Decision Trees on Tabular Data

TL;DR

DeLTa introduces a data-privacy-friendly framework that integrates LLM reasoning with decision-tree rules for tabular prediction. By refining an ensemble of tree rules via an LLM-derived rule and applying a gradient-based residual correction, it achieves state-of-the-art results without requiring LLM fine-tuning or tabular data serialization. The approach is validated across diverse datasets, demonstrating strong performance in both full-data and few-shot settings, along with favorable computational efficiency. The work highlights a principled way to combine expert rule structures with large-language model reasoning to improve tabular learning while preserving privacy. This has practical implications for healthcare, finance, and other domains where data privacy and sample efficiency are critical.

Abstract

Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc. With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed. Most of these LLM-based methods typically first serialize tabular data into natural language descriptions, and then tune LLMs or directly infer on these serialized data. However, these methods suffer from two key inherent issues: (i) data perspective: existing data serialization methods lack universal applicability for structured tabular data, and may pose privacy risks through direct textual exposure, and (ii) model perspective: LLM fine-tuning methods struggle with tabular data, and in-context learning scalability is bottle-necked by input length constraints (suitable for few-shot learning). This work explores a novel direction of integrating LLMs into tabular data throughough logical decision tree rules as intermediaries, proposes a decision tree enhancer with LLM-derived rule for tabular prediction, DeLTa. The proposed DeLTa avoids tabular data serialization, and can be applied to full data learning setting without LLM fine-tuning. Specifically, we leverage the reasoning ability of LLMs to redesign an improved rule given a set of decision tree rules. Furthermore, we provide a calibration method for original decision trees via new generated rule by LLM, which approximates the error correction vector to steer the original decision tree predictions in the direction of ``errors'' reducing. Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance.

Paper Structure

This paper contains 28 sections, 2 theorems, 7 equations, 10 figures, 20 tables, 1 algorithm.

Key Result

Proposition 1

Let $\mathbb{E}\left[\mathcal{L}(F(x), y)\right]$ denote the expected loss, where $F(x) = \frac{1}{K}\sum_k^K f_k(x|\mathcal{D}_{train}^k, r_k)$ and each $f_k$ corresponds to a decision tree rule $r_k$ from the expert-derived rule set $\mathcal{R} = \{r_k\}_{k=1}^K$. Given a prompt $p$ that contains

Figures (10)

  • Figure 1: The DeLTa framework. As shown in the main objective, we calibrate the output of original decision tree experts $F(x)$ in the direction of "errors" reducing. Subfig (a) describes the process of refining decision tree rules with LLM, and subfig (b) details the refined rule-guided error correction for decision trees.
  • Figure 2: Average intra-node distance comparison.
  • Figure 3: Test NRMSE ($\downarrow$) performance of DeLTa and LLM-based baseline methods on regression tasks.
  • Figure 4: Test performance of DeLTa and non-LLM baseline methods on classification and regression tasks. Here, "RF" denotes Random Forest, "FT-T" denotes FT-Transformer, "MNCA" denotes ModernNCA.
  • Figure 5: Visualization of label prediction of DeLTa w/ and w/o error correction vector $\Delta_x$ on BA dataset.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2