From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Robert Vacareanu; Vlad-Andrei Negru; Vasile Suciu; Mihai Surdeanu

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Robert Vacareanu, Vlad-Andrei Negru, Vasile Suciu, Mihai Surdeanu

TL;DR

The paper investigates whether pre-trained large language models can perform both linear and non-linear regression purely through in-context exemplars, without gradient updates. Using synthetic regression datasets, Friedman benchmarks, and symbolic-input tasks, it compares 12 LLMs against traditional supervised and unsupervised baselines, revealing that LLMs often rival or surpass classic methods, especially in linear tasks, and extend to non-linear regimes with notable success. It also analyzes how performance scales with the number of in-context examples, showing sub-linear regret for strong models and gradual adaptation toward hindsight-optimal decisions. The work highlights the potential of LLMs as general-purpose regressors under ICL, discusses data contamination and ethical considerations, and provides open-source code to reproduce and extend the findings.

Abstract

We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting. For example, on the challenging Friedman #2 regression dataset, Claude 3 outperforms many supervised methods such as AdaBoost, SVM, Random Forest, KNN, or Gradient Boosting. We then investigate how well the performance of large language models scales with the number of in-context exemplars. We borrow from the notion of regret from online learning and empirically show that LLMs are capable of obtaining a sub-linear regret.

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

TL;DR

Abstract

Paper Structure (55 sections, 20 equations, 48 figures, 14 tables)

This paper contains 55 sections, 20 equations, 48 figures, 14 tables.

Introduction
Experimental Setup
Datasets
Linear Regression Datasets
Non-Linear Regression Datasets
Regression With Non-Numerical Inputs
Models
LLMs:
Supervised Baselines:
Unsupervised Baselines:
Large Language Models Can Do Linear Regression
Large Language Models Can Do Non-Linear Regression
Friedman Benchmarks
New Regression Datasets
Discussion
...and 40 more sections

Figures (48)

Figure 1: Mean Absolute Error ($\downarrow$) comparison between three large language models (LLMs) and four traditional supervised methods for learning a linear regression function with one informative variable out of two. Given only in-context examples and without any additional training or gradient updates, pre-trained LLMs such as Claude 3, GPT-4, or DBRX can outperform supervised methods such as Random Forest or Gradient Boosting.
Figure 2: The performance, as measured by the Mean Absolute Error ($\downarrow$), across large language models (LLM), traditional supervised models and unsupervised models on two different random regression tasks: (a) sparse linear regression, where only 1 out of a total of 3 variables is informative, and (b) linear regression with two informative variables. The results are averages with 95% confidence intervals from 100 runs with varied random seeds. All LLMs perform better than the unsupervised models, suggesting a more sophisticated underlying mechanism at play in ICL. Furthermore, some LLMs (e.g., Claude 3) even outperform traditional supervised methods such as Random Forest or Gradient Boosting.
Figure 3: The rank of each method investigated over all four linear regression datasets. Rankings are visually encoded with a color gradient, where green means better performance (higher ranks) and red indicates worse performance (lower ranks). Notably, very strong LLMs such as Claude 3 and GPT-4 consistently outperform traditional supervised methods such as Gradient Boosting, Random Forest, or KNN. (best viewed in color)
Figure 4: The performance of large language models (LLM), traditional supervised models and unsupervised models on Friedman #1, #2, and #3. The results represent the averages with 95% confidence intervals over 100 different runs.
Figure 5: An example of one of our new non-linear regression functions. The function was designed to mimic a linear trend with oscillations.
...and 43 more figures

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

TL;DR

Abstract

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Authors

TL;DR

Abstract

Table of Contents

Figures (48)