Table of Contents
Fetching ...

Learning vs Retrieval: The Role of In-Context Examples in Regression with Large Language Models

Aliakbar Nafar, Kristen Brent Venable, Parisa Kordjamshidi

TL;DR

The paper investigates how in-context learning operates in large language models for regression, arguing that ICL emerges from a spectrum between meta-learning and internal knowledge retrieval. It introduces an evaluation framework using three prompt configurations (Named Features, Anonymized Features, Randomized Ground Truth) across three datasets and three LLMs to quantify learning versus retrieval. Key findings show that combining explicit feature-name information with in-context examples enables knowledge retrieval and learning to cooperate, improving data efficiency in low-data regimes, while excessive in-context data or random ground-truth prompts shift the balance toward learning or degrade performance. The study offers practical prompt-engineering guidance, demonstrates data-contamination risks, and outlines directions for robust ICL deployment in real-world regression tasks.

Abstract

Generative Large Language Models (LLMs) are capable of being in-context learners. However, the underlying mechanism of in-context learning (ICL) is still a major research question, and experimental research results about how models exploit ICL are not always consistent. In this work, we propose a framework for evaluating in-context learning mechanisms, which we claim are a combination of retrieving internal knowledge and learning from in-context examples by focusing on regression tasks. First, we show that LLMs can solve real-world regression problems and then design experiments to measure the extent to which the LLM retrieves its internal knowledge versus learning from in-context examples. We argue that this process lies on a spectrum between these two extremes. We provide an in-depth analysis of the degrees to which these mechanisms are triggered depending on various factors, such as prior knowledge about the tasks and the type and richness of the information provided by the in-context examples. We employ three LLMs and utilize multiple datasets to corroborate the robustness of our findings. Our results shed light on how to engineer prompts to leverage meta-learning from in-context examples and foster knowledge retrieval depending on the problem being addressed.

Learning vs Retrieval: The Role of In-Context Examples in Regression with Large Language Models

TL;DR

The paper investigates how in-context learning operates in large language models for regression, arguing that ICL emerges from a spectrum between meta-learning and internal knowledge retrieval. It introduces an evaluation framework using three prompt configurations (Named Features, Anonymized Features, Randomized Ground Truth) across three datasets and three LLMs to quantify learning versus retrieval. Key findings show that combining explicit feature-name information with in-context examples enables knowledge retrieval and learning to cooperate, improving data efficiency in low-data regimes, while excessive in-context data or random ground-truth prompts shift the balance toward learning or degrade performance. The study offers practical prompt-engineering guidance, demonstrates data-contamination risks, and outlines directions for robust ICL deployment in real-world regression tasks.

Abstract

Generative Large Language Models (LLMs) are capable of being in-context learners. However, the underlying mechanism of in-context learning (ICL) is still a major research question, and experimental research results about how models exploit ICL are not always consistent. In this work, we propose a framework for evaluating in-context learning mechanisms, which we claim are a combination of retrieving internal knowledge and learning from in-context examples by focusing on regression tasks. First, we show that LLMs can solve real-world regression problems and then design experiments to measure the extent to which the LLM retrieves its internal knowledge versus learning from in-context examples. We argue that this process lies on a spectrum between these two extremes. We provide an in-depth analysis of the degrees to which these mechanisms are triggered depending on various factors, such as prior knowledge about the tasks and the type and richness of the information provided by the in-context examples. We employ three LLMs and utilize multiple datasets to corroborate the robustness of our findings. Our results shed light on how to engineer prompts to leverage meta-learning from in-context examples and foster knowledge retrieval depending on the problem being addressed.
Paper Structure (39 sections, 1 equation, 25 figures)

This paper contains 39 sections, 1 equation, 25 figures.

Figures (25)

  • Figure 1: The three main prompt configurations: In configuration a) the actual names of the features and the output are known, and the LLM is asked to guess the "price of a used Toyota or Maserati in 2019". Configuration b) is similar to a) except that the feature names are anonymized. Here, the LLM is asked to estimate the "Output". In Configuration c), we replace the real prices of in-context examples with randomly (Gaussian) generated numbers.
  • Figure 2: Baseline results (Direct QA configuration) across datasets and number of features. The dashed red line shows the performance of the Mean model.
  • Figure 3: Comprehensive comparison of prompt configurations' effects on our models across various factors, shown in a hierarchy. The top level for each dataset distinguishes between GPT-3, LLaMA 3, and GPT-4 results using black, grey, and white arcs, respectively. The notation $IC_i$ indicates the number of in-context examples, while F1, F2, and F3 represent the use of the first feature, the first two features, and all three features, respectively. The MSE scale of each dataset is shown at top left corner.
  • Figure 4: Comparison of the number of in-context examples using Named Features (straight lines) and Anonymized Features (dashed lines) prompt configurations. F1, F2 and F3 indicate using 1st (F1), then 1st and 2nd (F2), and all three features (F3).The MSE scale of each dataset is shown at the top left corner.
  • Figure 5: Performance of Named Features and Anonymized Features prompt configurations, Ridge, and RandomForest for 3 features based on the number of in-context examples.
  • ...and 20 more figures