Table of Contents
Fetching ...

Probing the Decision Boundaries of In-context Learning in Large Language Models

Siyan Zhao, Tung Nguyen, Aditya Grover

TL;DR

This work reframes in-context learning in large language models as a study of decision boundaries in binary classification tasks. It reveals that state-of-the-art LLMs typically exhibit irregular, non-smooth boundaries even on simple separable tasks, and that boundary shape is influenced by factors such as model size, prompt design, and data representations. The authors show that smoothing boundaries is possible via strategies like early-layer finetuning, synthetic-task fine-tuning, and uncertainty-aware active sampling, including training transformers from scratch with TNP architectures. These findings offer practical guidance to improve robustness and generalization in in-context learning and provide a diagnostic tool for understanding LLM inductive biases.

Abstract

In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. Decision boundaries are straightforward to visualize and provide important information about the qualitative behavior of the inductive biases of standard classifiers. To our surprise, we find that the decision boundaries learned by current LLMs in simple binary classification tasks are often irregular and non-smooth, regardless of linear separability in the underlying task. This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability. We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner. Our findings provide a deeper understanding of in-context learning dynamics and offer practical improvements for enhancing robustness and generalizability of in-context learning.

Probing the Decision Boundaries of In-context Learning in Large Language Models

TL;DR

This work reframes in-context learning in large language models as a study of decision boundaries in binary classification tasks. It reveals that state-of-the-art LLMs typically exhibit irregular, non-smooth boundaries even on simple separable tasks, and that boundary shape is influenced by factors such as model size, prompt design, and data representations. The authors show that smoothing boundaries is possible via strategies like early-layer finetuning, synthetic-task fine-tuning, and uncertainty-aware active sampling, including training transformers from scratch with TNP architectures. These findings offer practical guidance to improve robustness and generalization in in-context learning and provide a diagnostic tool for understanding LLM inductive biases.

Abstract

In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. Decision boundaries are straightforward to visualize and provide important information about the qualitative behavior of the inductive biases of standard classifiers. To our surprise, we find that the decision boundaries learned by current LLMs in simple binary classification tasks are often irregular and non-smooth, regardless of linear separability in the underlying task. This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability. We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner. Our findings provide a deeper understanding of in-context learning dynamics and offer practical improvements for enhancing robustness and generalizability of in-context learning.
Paper Structure (22 sections, 3 equations, 19 figures, 1 table)

This paper contains 22 sections, 3 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Decision boundaries of LLMs and traditional machine learning models on a linearly separable binary classification task. The background colors represent the model's predictions, while the points represent the in-context or training examples. LLMs exhibit non-smooth decision boundaries compared to the classical models. See Appendix \ref{['appendix:traditional']} for model hyperparameters.
  • Figure 2: Visualizations of decision boundaries for various LLMs, ranging in size from 1.3B to 13B, on a linearly seperable binary classification task. The in-context data points are shown as scatter points and the colors indicate the label determined by each model. These decision boundaries are obtained using 128 in-context examples. The visualization highlights that the decision boundaries of these language models are not smooth.
  • Figure 3: Test accuracy for LLMs and baselines across three classification tasks (linear, circle, and moon), with each subplot illustrating the test accuracy as the number of in-context examples increases. The baselines are the SVM with a polynomial kernel and the MLP with two hidden layers. Shaded regions represent the standard error of the mean accuracy across 5 seeds.
  • Figure 4: Decision boundary of Llama2-7b with increasing in-context examples from 8 to 256.
  • Figure 5: The sensitivity of the Llama3-8b model's decision boundary to the order of in-context examples. Each subplot (Order 0 to Order 4) shows the model's decision boundary with the same 32 examples shuffled differently.
  • ...and 14 more figures