Table of Contents
Fetching ...

The representation landscape of few-shot learning and fine-tuning in large language models

Diego Doimo, Alessandro Serra, Alessio Ansuini, Alberto Cazzaniga

TL;DR

This work compares how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network.

Abstract

In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.

The representation landscape of few-shot learning and fine-tuning in large language models

TL;DR

This work compares how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network.

Abstract

In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.
Paper Structure (59 sections, 1 equation, 31 figures, 2 tables)

This paper contains 59 sections, 1 equation, 31 figures, 2 tables.

Figures (31)

  • Figure 1: The LLMs representation landscape of few-shot learning and fine-tuning. This figure illustrates the distribution of probability modes in large language models (LLMs) during a question-answering task (MMLU). The top row shows representations from layers near the input, while the bottom row shows those near the output. We compare three scenarios: zero-shot (0-shot, left), in-context learning (5-shot, center), and fine-tuning (right). In the 5-shot scenario, early layers (top, center) develop better representations of the dataset's subjects. Conversely, in the fine-tuned model, the late layers (bottom, right) more accurately reflect better the letter answers.
  • Figure 2: Intrinsic dimension, number of density peaks, and fraction of core points. Figure shows the ID (left), the number of density peaks (center), and the fraction of core points (right) for the last-token representation of Llama3-8b for an increasing number of few-shots and fine-tuned models. The three quantities change in the proximity of layer 17 in a two-phased fashion.
  • Figure 3: Adjusted Rand Index (ARI) between clusters and subjects. ARI between clusters and the subjects for Llama-3-8b (left), Llama-3-70b (center), and Mistral-7b (right) for an increasing number of few-shots and fine-tuned representations. In all cases, the match between cluster and subjects partition is highest at the beginning of the network and for an increasing number of shots.
  • Figure 4: Density peaks in the layers that best encode the subjects in Llama3-8b. The dendrograms show the organization of the density peaks in Llama3-8b in the layers where the ARI with the subjects is highest for the 5-shot setup (top) and 0-shot set-up (bottom left) and fine-tuned model (bottom-right). In the 5-shot setup, the clusters are populated by examples from one or two related subjects, and their similarity reflects the semantic relationships between the subjects. In 0-shot and fine-tuned representations (bottom panels), some large clusters contain many subjects.
  • Figure 5: Adjusted Rand Index between clusters and final answers. Adjusted Rand Index (ARI) between clusters and the MMLU answers (test set) for Llama3-8b (left), Llama3-70b (center), and Mistral-7b (right). In the second part of the network, the purity of the clusters w.r.t the answer partition is highest for fine-tuned models.
  • ...and 26 more figures