LLM Embeddings for Deep Learning on Tabular Data
Boshko Koloski, Andrei Margeloiu, Xiangjian Jiang, Blaž Škrlj, Nikola Simidjievski, Mateja Jamnik
TL;DR
The paper addresses encoding heterogeneous tabular data for deep learning by transforming feature-value interactions into text and using frozen LLM embeddings to form $E(X) ∈ \mathbb{R}^{N×M×d}$, followed by a light adapter and a downstream tabular model. The proposed method is model-agnostic and demonstrated to improve performance of ResNet, MLP, and FT-Transformer across seven datasets, with $d=1024$ and LLMs kept frozen. Key findings show consistent gains over base DL models, particularly for datasets with many categorical features, while larger LLMs can yield bigger improvements; however, tree-based ensembles still outperform neural baselines in some cases, though the gap narrows. Limitations include computational cost from per-feature LLM queries and potential misalignment when feature descriptions are sparse, with future work aiming at cross-table training and leveraging knowledge bases to strengthen feature-to-value interactions.
Abstract
Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them. Existing methods deal with this heterogeneous nature of tabular data by employing separate type-specific encoding approaches. This limits the cross-table transfer potential and the exploitation of pre-trained knowledge. We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution to improv ing deep-learning tabular methods. We demonstrate that our approach improves accuracy over competitive models, such as MLP, ResNet and FT-Transformer, by validating on seven classification datasets.
