LLM Meeting Decision Trees on Tabular Data
Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, Yi Chang
TL;DR
DeLTa introduces a data-privacy-friendly framework that integrates LLM reasoning with decision-tree rules for tabular prediction. By refining an ensemble of tree rules via an LLM-derived rule and applying a gradient-based residual correction, it achieves state-of-the-art results without requiring LLM fine-tuning or tabular data serialization. The approach is validated across diverse datasets, demonstrating strong performance in both full-data and few-shot settings, along with favorable computational efficiency. The work highlights a principled way to combine expert rule structures with large-language model reasoning to improve tabular learning while preserving privacy. This has practical implications for healthcare, finance, and other domains where data privacy and sample efficiency are critical.
Abstract
Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc. With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed. Most of these LLM-based methods typically first serialize tabular data into natural language descriptions, and then tune LLMs or directly infer on these serialized data. However, these methods suffer from two key inherent issues: (i) data perspective: existing data serialization methods lack universal applicability for structured tabular data, and may pose privacy risks through direct textual exposure, and (ii) model perspective: LLM fine-tuning methods struggle with tabular data, and in-context learning scalability is bottle-necked by input length constraints (suitable for few-shot learning). This work explores a novel direction of integrating LLMs into tabular data throughough logical decision tree rules as intermediaries, proposes a decision tree enhancer with LLM-derived rule for tabular prediction, DeLTa. The proposed DeLTa avoids tabular data serialization, and can be applied to full data learning setting without LLM fine-tuning. Specifically, we leverage the reasoning ability of LLMs to redesign an improved rule given a set of decision tree rules. Furthermore, we provide a calibration method for original decision trees via new generated rule by LLM, which approximates the error correction vector to steer the original decision tree predictions in the direction of ``errors'' reducing. Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance.
