LLM Embeddings for Deep Learning on Tabular Data

Boshko Koloski; Andrei Margeloiu; Xiangjian Jiang; Blaž Škrlj; Nikola Simidjievski; Mateja Jamnik

LLM Embeddings for Deep Learning on Tabular Data

Boshko Koloski, Andrei Margeloiu, Xiangjian Jiang, Blaž Škrlj, Nikola Simidjievski, Mateja Jamnik

TL;DR

The paper addresses encoding heterogeneous tabular data for deep learning by transforming feature-value interactions into text and using frozen LLM embeddings to form $E(X) ∈ \mathbb{R}^{N×M×d}$, followed by a light adapter and a downstream tabular model. The proposed method is model-agnostic and demonstrated to improve performance of ResNet, MLP, and FT-Transformer across seven datasets, with $d=1024$ and LLMs kept frozen. Key findings show consistent gains over base DL models, particularly for datasets with many categorical features, while larger LLMs can yield bigger improvements; however, tree-based ensembles still outperform neural baselines in some cases, though the gap narrows. Limitations include computational cost from per-feature LLM queries and potential misalignment when feature descriptions are sparse, with future work aiming at cross-table training and leveraging knowledge bases to strengthen feature-to-value interactions.

Abstract

Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them. Existing methods deal with this heterogeneous nature of tabular data by employing separate type-specific encoding approaches. This limits the cross-table transfer potential and the exploitation of pre-trained knowledge. We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution to improv ing deep-learning tabular methods. We demonstrate that our approach improves accuracy over competitive models, such as MLP, ResNet and FT-Transformer, by validating on seven classification datasets.

LLM Embeddings for Deep Learning on Tabular Data

TL;DR

The paper addresses encoding heterogeneous tabular data for deep learning by transforming feature-value interactions into text and using frozen LLM embeddings to form

, followed by a light adapter and a downstream tabular model. The proposed method is model-agnostic and demonstrated to improve performance of ResNet, MLP, and FT-Transformer across seven datasets, with

and LLMs kept frozen. Key findings show consistent gains over base DL models, particularly for datasets with many categorical features, while larger LLMs can yield bigger improvements; however, tree-based ensembles still outperform neural baselines in some cases, though the gap narrows. Limitations include computational cost from per-feature LLM queries and potential misalignment when feature descriptions are sparse, with future work aiming at cross-table training and leveraging knowledge bases to strengthen feature-to-value interactions.

LLM Embeddings for Deep Learning on Tabular Data

TL;DR

Abstract

LLM Embeddings for Deep Learning on Tabular Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)