Table of Contents
Fetching ...

Understanding LLM Embeddings for Regression

Eric Tang, Bangding Yang, Xingyou Song

TL;DR

The paper investigates regression using fixed LLM embeddings as downstream features, contrasting with traditional feature engineering and decoding-based methods. It formalizes the regression task with a consistent downstream head and evaluates on both synthetic (BBOB) and real-world (Vizier) tasks, comparing $d_{llm}$ and $d_{trad}$ embeddings. A key contribution is the Normalized Lipschitz Factor Distribution (NLFD), linking embedding smoothness to regression performance, and showing that $LLM$ embeddings often retain performance in high-DOF settings and exhibit nuanced model-size effects. The findings suggest practical implications for embedding-based regression across high-dimensional, non-tabular inputs, while acknowledging limitations and future directions in non-tabular modalities and multi-modal data.

Abstract

With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.

Understanding LLM Embeddings for Regression

TL;DR

The paper investigates regression using fixed LLM embeddings as downstream features, contrasting with traditional feature engineering and decoding-based methods. It formalizes the regression task with a consistent downstream head and evaluates on both synthetic (BBOB) and real-world (Vizier) tasks, comparing and embeddings. A key contribution is the Normalized Lipschitz Factor Distribution (NLFD), linking embedding smoothness to regression performance, and showing that embeddings often retain performance in high-DOF settings and exhibit nuanced model-size effects. The findings suggest practical implications for embedding-based regression across high-dimensional, non-tabular inputs, while acknowledging limitations and future directions in non-tabular modalities and multi-modal data.

Abstract

With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.

Paper Structure

This paper contains 23 sections, 1 equation, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Rugged surface of a 5D Sphere function when inputs are represented as Gemini embeddings of dimension 6K+, post-processed by t-SNE into 2D space.
  • Figure 2: Higher ($\uparrow$) is better. Degrees of freedom (DOF) vs Kendall-Tau correlation for various BBOB functions. Results are averaged over 12 runs for each regression method. Each task's data consists of 500 $(x,y)$ evaluations sampled uniformly across the input space, using a 8-1-1 split for train-validation-test.
  • Figure 3: Left-skewness ($\leftarrow$) is better. NLFDs induced by $\phi_{\text{LLM}}$ (T5-XXL) and $\phi_{\text{trad}}$. Top: Cases where $\phi_{\text{LLM}}$ outperforms $\phi_{\text{trad}}$ for regression. Bottom: Vice-versa where $\phi_{\text{trad}}$ outperforms $\phi_{\text{LLM}}$.
  • Figure 4: Relationship between gaps in NLFD (via Z-score) and regression performance for all 23 BBOB functions. Relationship is quantified using (K, S, P), which respectively are Kendall-Tau, Spearman and Pearson correlations. Top: We vary model size within the T5 model family. Bottom: We vary the objective's DOF for Gemini Pro.
  • Figure 5: t-SNE for Gemini (Nano and Pro) embeddings of points sampled around a DOF=100 reference point. Traditional $\ell_{2}$ distance is overlayed in color.
  • ...and 9 more figures