Table of Contents
Fetching ...

LLM Embeddings for Deep Learning on Tabular Data

Boshko Koloski, Andrei Margeloiu, Xiangjian Jiang, Blaž Škrlj, Nikola Simidjievski, Mateja Jamnik

TL;DR

The paper addresses encoding heterogeneous tabular data for deep learning by transforming feature-value interactions into text and using frozen LLM embeddings to form $E(X) ∈ \mathbb{R}^{N×M×d}$, followed by a light adapter and a downstream tabular model. The proposed method is model-agnostic and demonstrated to improve performance of ResNet, MLP, and FT-Transformer across seven datasets, with $d=1024$ and LLMs kept frozen. Key findings show consistent gains over base DL models, particularly for datasets with many categorical features, while larger LLMs can yield bigger improvements; however, tree-based ensembles still outperform neural baselines in some cases, though the gap narrows. Limitations include computational cost from per-feature LLM queries and potential misalignment when feature descriptions are sparse, with future work aiming at cross-table training and leveraging knowledge bases to strengthen feature-to-value interactions.

Abstract

Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them. Existing methods deal with this heterogeneous nature of tabular data by employing separate type-specific encoding approaches. This limits the cross-table transfer potential and the exploitation of pre-trained knowledge. We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution to improv ing deep-learning tabular methods. We demonstrate that our approach improves accuracy over competitive models, such as MLP, ResNet and FT-Transformer, by validating on seven classification datasets.

LLM Embeddings for Deep Learning on Tabular Data

TL;DR

The paper addresses encoding heterogeneous tabular data for deep learning by transforming feature-value interactions into text and using frozen LLM embeddings to form , followed by a light adapter and a downstream tabular model. The proposed method is model-agnostic and demonstrated to improve performance of ResNet, MLP, and FT-Transformer across seven datasets, with and LLMs kept frozen. Key findings show consistent gains over base DL models, particularly for datasets with many categorical features, while larger LLMs can yield bigger improvements; however, tree-based ensembles still outperform neural baselines in some cases, though the gap narrows. Limitations include computational cost from per-feature LLM queries and potential misalignment when feature descriptions are sparse, with future work aiming at cross-table training and leveraging knowledge bases to strengthen feature-to-value interactions.

Abstract

Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them. Existing methods deal with this heterogeneous nature of tabular data by employing separate type-specific encoding approaches. This limits the cross-table transfer potential and the exploitation of pre-trained knowledge. We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution to improv ing deep-learning tabular methods. We demonstrate that our approach improves accuracy over competitive models, such as MLP, ResNet and FT-Transformer, by validating on seven classification datasets.

Paper Structure

This paper contains 13 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Schema of our proposed methodology. (1) The input is first serialised, feature by feature, into sentences. (2) Large Language Models (LLMs) are used to extract embeddings of the inputs. (3) We project and adapt the embeddings with an MLP. (4) We apply trainable models that utilise the LLM embeddings for feature encoding.
  • Figure 2: Comparing relative test performance of base models and their LLM-enhanced variants. Using LLMs generally improves performance, with BGE showing the most consistent improvements.
  • Figure 3: Projection of the embedded features with the 'BGE' model. For demonstration purposes we show at most 20 randomly selected unique values.
  • Figure 4: Hierarchical Bayesian t-test assessing the probability that the LLM-based embeddings outperform the base embeddings across models, 7 datasets, and 10 random seeds.