Table of Contents
Fetching ...

In-Context Function Learning in Large Language Models

Elif Akata, Konstantinos Voudouris, Vincent Fortuin, Eric Schulz

TL;DR

The paper investigates how large language models perform in-context learning for continuous function tasks by casting ICL as Gaussian Process regression with known priors. It proposes a principled evaluation framework using GP regression as a lower bound and 1-NN as an upper bound, analyzes inductive biases via likelihood comparisons across kernels, and tests parameter-efficient post-training (SFT and GRPO) to steer these priors. Key findings show GP-like learning curves that improve with model size, kernel smoothness shaping learning speed, and a bias toward rough functions in low dimensions that shifts toward smoother functions in higher dimensions; post-training can steer these priors toward the structure of the training data, with GRPO offering more generalization. The work provides a quantitative framework for understanding and steering in-context function learning in LLMs, with implications for data-efficient continuous-function tasks and model alignment.

Abstract

Large language models (LLMs) can learn from a few demonstrations provided at inference time. We study this in-context learning phenomenon through the lens of Gaussian Processes (GPs). We build controlled experiments where models observe sequences of multivariate scalar-valued function samples drawn from known GP priors. We evaluate prediction error in relation to the number of demonstrations and compare against two principled references: (i) an empirical GP-regression learner that gives a lower bound on achievable error, and (ii) the expected error of a 1-nearest-neighbor (1-NN) rule, which gives a data-driven upper bound. Across model sizes, we find that LLM learning curves are strongly influenced by the function-generating kernels and approach the GP lower bound as the number of demonstrations increases. We then study the inductive biases of these models using a likelihood-based analysis. We find that LLM predictions are most likely under less smooth GP kernels. Finally, we explore whether post-training can shift these inductive biases and improve sample-efficiency on functions sampled from GPs with smoother kernels. We find that both reinforcement learning and supervised fine-tuning can effectively shift inductive biases in the direction of the training data. Together, our framework quantifies the extent to which LLMs behave like GP learners and provides tools for steering their inductive biases for continuous function learning tasks.

In-Context Function Learning in Large Language Models

TL;DR

The paper investigates how large language models perform in-context learning for continuous function tasks by casting ICL as Gaussian Process regression with known priors. It proposes a principled evaluation framework using GP regression as a lower bound and 1-NN as an upper bound, analyzes inductive biases via likelihood comparisons across kernels, and tests parameter-efficient post-training (SFT and GRPO) to steer these priors. Key findings show GP-like learning curves that improve with model size, kernel smoothness shaping learning speed, and a bias toward rough functions in low dimensions that shifts toward smoother functions in higher dimensions; post-training can steer these priors toward the structure of the training data, with GRPO offering more generalization. The work provides a quantitative framework for understanding and steering in-context function learning in LLMs, with implications for data-efficient continuous-function tasks and model alignment.

Abstract

Large language models (LLMs) can learn from a few demonstrations provided at inference time. We study this in-context learning phenomenon through the lens of Gaussian Processes (GPs). We build controlled experiments where models observe sequences of multivariate scalar-valued function samples drawn from known GP priors. We evaluate prediction error in relation to the number of demonstrations and compare against two principled references: (i) an empirical GP-regression learner that gives a lower bound on achievable error, and (ii) the expected error of a 1-nearest-neighbor (1-NN) rule, which gives a data-driven upper bound. Across model sizes, we find that LLM learning curves are strongly influenced by the function-generating kernels and approach the GP lower bound as the number of demonstrations increases. We then study the inductive biases of these models using a likelihood-based analysis. We find that LLM predictions are most likely under less smooth GP kernels. Finally, we explore whether post-training can shift these inductive biases and improve sample-efficiency on functions sampled from GPs with smoother kernels. We find that both reinforcement learning and supervised fine-tuning can effectively shift inductive biases in the direction of the training data. Together, our framework quantifies the extent to which LLMs behave like GP learners and provides tools for steering their inductive biases for continuous function learning tasks.
Paper Structure (25 sections, 7 equations, 6 figures)

This paper contains 25 sections, 7 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of our framework. (a) In-context function learning in base models: a large language model (LLM) receives demonstrations from functions sampled from known Gaussian-process (GP) priors and predicts $f(\mathbf{X})$ at a new $\mathbf{X}$. (b) Post-training: the model is fine-tuned (SFT or GRPO) and re-evaluated to measure changes in learning curves. (c) Inductive-bias analysis: a likelihood comparison identifies the GP kernels that best explain model predictions before and after training.
  • Figure 2: Learning curve analysis on 1-dimensional functions. The mean absolute error after $n$ demonstrations by function type. Left Four: Qwen-3-8B learning curves four functions drawn from four kernels, compared to the error of a GP regression and the expected error of a 1-nearest neighbor rule. The LLM learning curves generally approach the GP regression baseline and are well below the 1-NN rule. Right Four: Model size comparisons between the 8B, 14B, and 32B Qwen-3 models, on identical data. The 14B and 32B models show noticeably lower error rates, but do not differ significantly from each other, suggesting a logarithmic scaling law. All LLM and GP learning curves are shown with 95% bootstrapped confidence intervals.
  • Figure 3: Qwen-3-14B learning curve comparison for 1-, 2-, 3-, 4-dimensional data drawn from the Squared Exponential with $\lambda=8$. The mean absolute error after $n$ demonstrations by function type. All LLM and GP learning curves are shown with 95% bootstrapped confidence intervals.
  • Figure 4: Inductive bias analysis of the base models (8B, 14B, 32B). The average likelihood per prediction is computed under GPs with four different kernels ($\ell=8$), on 1-dimensional data drawn from either the Squared Exponential or the Matérn $\frac{1}{2}$. These likelihoods are presented on a symmetric log scale. LLM predictions for all model sizes are more likely under kernels with lower $\nu$, i.e., those that describe rougher, less predictable functions.
  • Figure 5: Inductive bias analysis compared to the base model for the 8B model for three checkpoints (1k, 2k, 5k, 10k steps). Average likelihood per prediction shown on a symlog scale.
  • ...and 1 more figures