Table of Contents
Fetching ...

Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

Oksana Kolomenko, Ricardo Knauer, Erik Rodner

Abstract

Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.

Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

Abstract

Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.
Paper Structure (14 sections, 9 figures, 2 tables)

This paper contains 14 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of our work. We serialize tabular rows into text and use embedding models to convert them into numerical representations. These embeddings are preprocessed and serve as inputs for downstream models. We systematically evaluate 256 pipeline configurations, i.e., 16 embedding models, 8 preprocessing strategies, and 2 downstream models.
  • Figure 2: Top-20 mean embedding pipeline performance. Baselines are shown in light blue.
  • Figure 3: Correlation matrix heatmap between the predictive performance and embedding model attributes, based on the Spearman rank correlation.
  • Figure 4: Mean AUC gains by concatenating text embeddings instead of replacing the original columns with text embeddings.
  • Figure 5: Mean AUC gains by applying dimensionality reduction via PCA to text embeddings.
  • ...and 4 more figures