Table of Contents
Fetching ...

Emergent properties with repeated examples

François Charton, Julia Kempe

TL;DR

It is demonstrated that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples and that two-set training provides for faster learning and better performance.

Abstract

We study the performance of transformers as a function of the number of repetitions of training examples with algorithmically generated datasets. On three problems of mathematics: the greatest common divisor, modular multiplication, and matrix eigenvalues, we show that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples. We also demonstrate that two-set training - repeated use of a small random subset of examples, along normal sampling on the rest of the training set - provides for faster learning and better performance. This highlights that the benefits of repetition can outweigh those of data diversity. These datasets and problems provide a controlled setting to shed light on the still poorly understood interplay between generalization and memorization in deep learning.

Emergent properties with repeated examples

TL;DR

It is demonstrated that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples and that two-set training provides for faster learning and better performance.

Abstract

We study the performance of transformers as a function of the number of repetitions of training examples with algorithmically generated datasets. On three problems of mathematics: the greatest common divisor, modular multiplication, and matrix eigenvalues, we show that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples. We also demonstrate that two-set training - repeated use of a small random subset of examples, along normal sampling on the rest of the training set - provides for faster learning and better performance. This highlights that the benefits of repetition can outweigh those of data diversity. These datasets and problems provide a controlled setting to shed light on the still poorly understood interplay between generalization and memorization in deep learning.

Paper Structure

This paper contains 16 sections, 1 equation, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Repetition Helps (Left): Performance as a function of repetition for a fixed training budget ($600$M). GCD (blue). Models trained on smaller datasets, repeated $30$ times, perform much better than models trained on one to four epochs. Multiplication mod 67 (red). Models trained for $1$ to $4$ epochs do not learn. Learning "emerges" when models are trained on smaller data budgets, with increased repetition. Two-set training (Right): For a fixed data budget, splitting the data into two random subsets and increasing the training frequency of one greatly improves performance. GCD (left): repeating 50k examples 3000 times for a training budget of 600M brings performance from 37 to 69 on 100M. Modular multiplication (right): Models trained on 600M single-use examples do not learn. With $25$M examples repeated $18$ times, and $150$M single use examples, accuracy is $92\%$, with $2.5$M examples repeated $60$ times, and $450$M single-use, accuracy is $68\%$. Smooth distributions of repetition over the training set achieve $70\%$ accuracy.
  • Figure 2: GCD problem: (Left) GCD accuracy for different data and training budgets (average of 5 models). (Right) Test loss of models as a function of training budget, for fixed data budgets.
  • Figure 3: Two-set training for the GCD problem: Number of correctly predicted GCD as a function of $S$ and $p$. Each measurement is the average of $6$ models. Data budget $100$M, training budget $600$M. Note the high performance for very small sets $S$ of sizes $50$, $75$, $100$, $150$ and $200$ thousand, with $p=0.25$ and $p=0.5$.
  • Figure 4: Two-set versus single-set training for the GCD problem: Number of correct GCD as a function of training budget(up to $600$M) for data budgets of $10$M (left), $25$M (center), and $50$M (right). Two-set training with $p=0.25$ and $S=50,000$ (top 6 curves) versus single-set training (lower $6$ curves). See Figure \ref{['fig:twosamples_25M2']} in Appendix \ref{['app:add_exp']} for extended TB with DB of $50$M.
  • Figure 5: Two-set training for Modular Multiplication: Accuracy as a function of small set size $S$ and $p$, each averaged over $6$ models. Data budget $100$M (left) and unlimited (right), training budget $600$M. Note: the bottom right of the left graph correspond to single-set $10$M-models: for $p=0.1$ and $S=10$M, the small and large set are selected with the same probability.
  • ...and 4 more figures