Table of Contents
Fetching ...

Continually self-improving AI

Zitong Yang

Abstract

Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, although a model's weights can be updated via fine-tuning, acquiring new knowledge from small, specialized corpora after pretraining remains highly data-inefficient. Second, the training of these systems relies heavily on finite, human-generated data from across history. Third, the pipelines used to train AI models are confined by the algorithms that human researchers can discover and explore. This thesis takes a small step toward overcoming these inherent limitations, presenting three chapters aimed at breaking these dependencies to create continually self-improving AI. First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach that diversifies and amplifies small corpora into rich knowledge representations, enabling a model to effectively update its parameters from limited source material. Second, to reduce reliance on human data, we show that given a fixed amount of such data, the model can self-generate synthetic data to bootstrap its fundamental pretraining capabilities without distillation from any off-the-shelf, instruction-tuned LM. Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time over the space of algorithms, AI can search over a larger space of learning algorithm configurations than human researchers can explore manually.

Continually self-improving AI

Abstract

Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, although a model's weights can be updated via fine-tuning, acquiring new knowledge from small, specialized corpora after pretraining remains highly data-inefficient. Second, the training of these systems relies heavily on finite, human-generated data from across history. Third, the pipelines used to train AI models are confined by the algorithms that human researchers can discover and explore. This thesis takes a small step toward overcoming these inherent limitations, presenting three chapters aimed at breaking these dependencies to create continually self-improving AI. First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach that diversifies and amplifies small corpora into rich knowledge representations, enabling a model to effectively update its parameters from limited source material. Second, to reduce reliance on human data, we show that given a fixed amount of such data, the model can self-generate synthetic data to bootstrap its fundamental pretraining capabilities without distillation from any off-the-shelf, instruction-tuned LM. Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time over the space of algorithms, AI can search over a larger space of learning algorithm configurations than human researchers can explore manually.
Paper Structure (249 sections, 4 theorems, 87 equations, 29 figures, 29 tables, 2 algorithms)

This paper contains 249 sections, 4 theorems, 87 equations, 29 figures, 29 tables, 2 algorithms.

Key Result

Theorem 1

For any time $t \geq 1$ and any $\varepsilon>0$, the link density satisfies, with probability $\to 1$,

Figures (29)

  • Figure 1: Synthetic continued pretraining (synthetic CPT) converts a small source corpus into a large synthetic corpus that is amenable to learning via standard continued pretraining. We instantiate synthetic CPT using a synthetic data augmentation algorithm called EntiGraph, which forms a knowledge graph over entities extracted from documents, and then prompts an LM to synthesize a text-based representation of the graph.
  • Figure 2: Accuracy on the QuALITY question set $\mathcal{Q}_{\text{test}}$ ($y$-axis) as a function of the synthetic token count ($x$-axis). The accuracy of synthetic continued pretraining using the EntiGraph data augmentation algorithm (EntiGraph CPT) scales log-linearly up to 455M tokens.
  • Figure 3: Closed-book summarization: number of false claims ($y$-axis) versus number of salient claims ($x$-axis) normalized by the human summary.
  • Figure 4: The scaling properties of Synthetic CPT with the EntiGraph and Rephrase augmentations, comparing two synthetic data generators: GPT-4-Turbo and Llama 3.1 8B Instruct.
  • Figure 5: The scaling properties of Synthetic CPT using the EntiGraph augmentation on the Coursera Exam QA dataset.
  • ...and 24 more figures

Theorems & Definitions (8)

  • Definition 1: Continually self-improving AI
  • Definition 2
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:toy']}
  • Lemma 1.E.1: Lemma 1 and Corollary 1 in karp1990transitive
  • Lemma 1.E.2: Theorem 3 in karp1990transitive and Theorem 2.4.1 in durrett2010random
  • Lemma 1.E.3
  • proof : Proof of Lemma \ref{['lem:shape']}