Continually self-improving AI

Zitong Yang

Continually self-improving AI

Zitong Yang

Abstract

Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, although a model's weights can be updated via fine-tuning, acquiring new knowledge from small, specialized corpora after pretraining remains highly data-inefficient. Second, the training of these systems relies heavily on finite, human-generated data from across history. Third, the pipelines used to train AI models are confined by the algorithms that human researchers can discover and explore. This thesis takes a small step toward overcoming these inherent limitations, presenting three chapters aimed at breaking these dependencies to create continually self-improving AI. First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach that diversifies and amplifies small corpora into rich knowledge representations, enabling a model to effectively update its parameters from limited source material. Second, to reduce reliance on human data, we show that given a fixed amount of such data, the model can self-generate synthetic data to bootstrap its fundamental pretraining capabilities without distillation from any off-the-shelf, instruction-tuned LM. Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time over the space of algorithms, AI can search over a larger space of learning algorithm configurations than human researchers can explore manually.

Continually self-improving AI

Abstract

Paper Structure (249 sections, 4 theorems, 87 equations, 29 figures, 29 tables, 2 algorithms)

This paper contains 249 sections, 4 theorems, 87 equations, 29 figures, 29 tables, 2 algorithms.

Introduction
Defining continually self-improving AI
Continual knowledge acquisition
Bootstrapping pretraining capabilities
Towards AI-designed AI via test-time search
Publications
Related work
Continual knowledge acquisition
Synthetic generation of pretraining data
Continued pretraining
Knowledge editing
Synthetic data generation
Continual learning and pretraining
Bootstrapping pretraining capabilities
LM pretraining
...and 234 more sections

Key Result

Theorem 1

For any time $t \geq 1$ and any $\varepsilon>0$, the link density satisfies, with probability $\to 1$,

Figures (29)

Figure 1: Synthetic continued pretraining (synthetic CPT) converts a small source corpus into a large synthetic corpus that is amenable to learning via standard continued pretraining. We instantiate synthetic CPT using a synthetic data augmentation algorithm called EntiGraph, which forms a knowledge graph over entities extracted from documents, and then prompts an LM to synthesize a text-based representation of the graph.
Figure 2: Accuracy on the QuALITY question set $\mathcal{Q}_{\text{test}}$ ($y$-axis) as a function of the synthetic token count ($x$-axis). The accuracy of synthetic continued pretraining using the EntiGraph data augmentation algorithm (EntiGraph CPT) scales log-linearly up to 455M tokens.
Figure 3: Closed-book summarization: number of false claims ($y$-axis) versus number of salient claims ($x$-axis) normalized by the human summary.
Figure 4: The scaling properties of Synthetic CPT with the EntiGraph and Rephrase augmentations, comparing two synthetic data generators: GPT-4-Turbo and Llama 3.1 8B Instruct.
Figure 5: The scaling properties of Synthetic CPT using the EntiGraph augmentation on the Coursera Exam QA dataset.
...and 24 more figures

Theorems & Definitions (8)

Definition 1: Continually self-improving AI
Definition 2
Theorem 1
proof : Proof of Theorem \ref{['thm:toy']}
Lemma 1.E.1: Lemma 1 and Corollary 1 in karp1990transitive
Lemma 1.E.2: Theorem 3 in karp1990transitive and Theorem 2.4.1 in durrett2010random
Lemma 1.E.3
proof : Proof of Lemma \ref{['lem:shape']}

Continually self-improving AI

Abstract

Continually self-improving AI

Authors

Abstract

Table of Contents

Key Result

Figures (29)

Theorems & Definitions (8)