Understanding LLM Embeddings for Regression
Eric Tang, Bangding Yang, Xingyou Song
TL;DR
The paper investigates regression using fixed LLM embeddings as downstream features, contrasting with traditional feature engineering and decoding-based methods. It formalizes the regression task with a consistent downstream head and evaluates on both synthetic (BBOB) and real-world (Vizier) tasks, comparing $d_{llm}$ and $d_{trad}$ embeddings. A key contribution is the Normalized Lipschitz Factor Distribution (NLFD), linking embedding smoothness to regression performance, and showing that $LLM$ embeddings often retain performance in high-DOF settings and exhibit nuanced model-size effects. The findings suggest practical implications for embedding-based regression across high-dimensional, non-tabular inputs, while acknowledging limitations and future directions in non-tabular modalities and multi-modal data.
Abstract
With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.
