Table of Contents
Fetching ...

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre

TL;DR

Hyperfitting reveals a counter-intuitive regime in which overfitting a pre-trained LLM on a tiny dataset to near-zero training loss dramatically sharpens long-sequence generation under greedy decoding, often surpassing larger models and Top-P baselines in human preference. The approach, validated across multiple models and even extending to autoregressive image generation, yields low-entropy, highly peaked next-token distributions and a stronger top-ranked token bias, yet can exhibit high perplexity on held-out data. The work introduces the Top-Rank Encouragement hypothesis to explain why low training loss improves token ranking without necessarily reducing perplexity, and shows that diversity remains adequate with limited data and even with a simple citation blocker. Collectively, these findings highlight a new, data-efficient generalization regime with practical implications for open-ended generation, evaluation, and cross-modal models, while posing important questions about data dependence and ranking dynamics.

Abstract

This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

TL;DR

Hyperfitting reveals a counter-intuitive regime in which overfitting a pre-trained LLM on a tiny dataset to near-zero training loss dramatically sharpens long-sequence generation under greedy decoding, often surpassing larger models and Top-P baselines in human preference. The approach, validated across multiple models and even extending to autoregressive image generation, yields low-entropy, highly peaked next-token distributions and a stronger top-ranked token bias, yet can exhibit high perplexity on held-out data. The work introduces the Top-Rank Encouragement hypothesis to explain why low training loss improves token ranking without necessarily reducing perplexity, and shows that diversity remains adequate with limited data and even with a simple citation blocker. Collectively, these findings highlight a new, data-efficient generalization regime with practical implications for open-ended generation, evaluation, and cross-modal models, while posing important questions about data dependence and ranking dynamics.

Abstract

This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.

Paper Structure

This paper contains 28 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Example of greedy decoding using Llama 3.1 and its hyperfitted counterpart. Color indicating how repetitive the generated text is.
  • Figure 2: Training and validation loss for TinyLlama during overfitting, along with the resulting mean TTR when greedily generating 96 tokens from contexts in the validation data.
  • Figure 3: Distribution of the longest overlap between 1000 generated texts and the dataset
  • Figure 4: A subsequence from the validation data and the corresponding top-3 predictions. The words: "Coverage", "Manchester" and "United" never appear in the hyperfitting dataset.
  • Figure 5: Left: Top-1 rank similarity matrix of Llama 3.1 (8B) hyperfitted on identical, but shuffled, data. Right: The resulting mean TTR of 300 generated texts as the number of training samples vary.
  • ...and 6 more figures