Table of Contents
Fetching ...

Language Modeling by Language Models

Junyan Cheng, Peter Clark, Kyle Richardson

TL;DR

Language Modeling by Language Models introduces Genesys, a fully autonomous framework that uses LLM-driven designer and verifier agents to discover novel LM block architectures beyond transformers. Central to the approach are the Language Model Architecture Discovery Environment (LMADE) and a Ladder of Scales verification scheme, enabling progressive, budget-aware pretraining and evaluation across model scales. Across large-scale experiments, the authors report thousands of discovered designs and show competitive performance relative to established baselines, with ablations underscoring the importance of verification, literature grounding, and unit-based code generation. Overall, the work demonstrates the feasibility of autonomous architectural discovery in ML and provides a structured methodology and empirical insights for future discovery systems.

Abstract

Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stages of research, from ideation and literature search (proposal stage) to design implementation (code generation), generative pre-training, and downstream evaluation (verification). Using ideas from scaling laws, our system, Genesys, employs a Ladder of Scales approach; new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M$\sim$350M parameters) with a narrowing budget (the number of models we can train at each scale). To help make discovery efficient and factorizable, Genesys uses a novel genetic programming backbone, which we show has empirical advantages over commonly used direct prompt generation workflows (e.g., $\sim$86\% percentage point improvement in successful design generation, a key bottleneck). We report experiments involving 1,162 newly discovered designs (1,062 fully verified through pre-training) and find the best designs to be highly competitive with known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common benchmarks). We couple these results with comprehensive system-level ablations and formal results, which give broader insights into the design of effective autonomous discovery systems.

Language Modeling by Language Models

TL;DR

Language Modeling by Language Models introduces Genesys, a fully autonomous framework that uses LLM-driven designer and verifier agents to discover novel LM block architectures beyond transformers. Central to the approach are the Language Model Architecture Discovery Environment (LMADE) and a Ladder of Scales verification scheme, enabling progressive, budget-aware pretraining and evaluation across model scales. Across large-scale experiments, the authors report thousands of discovered designs and show competitive performance relative to established baselines, with ablations underscoring the importance of verification, literature grounding, and unit-based code generation. Overall, the work demonstrates the feasibility of autonomous architectural discovery in ML and provides a structured methodology and empirical insights for future discovery systems.

Abstract

Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stages of research, from ideation and literature search (proposal stage) to design implementation (code generation), generative pre-training, and downstream evaluation (verification). Using ideas from scaling laws, our system, Genesys, employs a Ladder of Scales approach; new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M350M parameters) with a narrowing budget (the number of models we can train at each scale). To help make discovery efficient and factorizable, Genesys uses a novel genetic programming backbone, which we show has empirical advantages over commonly used direct prompt generation workflows (e.g., 86\% percentage point improvement in successful design generation, a key bottleneck). We report experiments involving 1,162 newly discovered designs (1,062 fully verified through pre-training) and find the best designs to be highly competitive with known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common benchmarks). We couple these results with comprehensive system-level ablations and formal results, which give broader insights into the design of effective autonomous discovery systems.

Paper Structure

This paper contains 140 sections, 11 theorems, 48 equations, 62 figures, 19 tables, 6 algorithms.

Key Result

Lemma 1

Figures (62)

  • Figure 1: Can we discover novel language model architectures? A high-level illustration of our approach, consisting of a discovery environment (Left), or LMADE, that provides knowledge access (Knowledge Engine) and automated evaluation (Verification Engine). Right: Genesys, a LLM-driven agent system that proposes, implements, then verifies new designs using design and verifier agents (see algorithmic workflow, far right) and feedback from LMADE.
  • Figure 2: An illustration of our reference library in LMADE -- a graph of papers on architecture design (nodes containing details of the original paper, code snippets, and other details) and citation links (edges) -- that our system queries when performing background research.
  • Figure 2: Results of evolution experiments under different configurations (%). Bold/underlined denotes the best/second.
  • Figure 3: What are we trying to discover? ① visualizes standard autoregressive LMs and the blocks that our system aims to discover (implemented via the Pytorch modules in ② and ④ with function type (X,Z) $\to$ (X,Z)). ⑤ shows an implemented block for the GPT and its factorization into a tree ③ that shows the units in that block (e.g., multi-head attention implemented in ⑥).
  • Figure 3: The error rate (%) during the design verification and evaluation stages under different system ablations.
  • ...and 57 more figures

Theorems & Definitions (13)

  • Lemma 1: Single-Shot Expected Calls
  • Lemma 2: Expected Calls: VS
  • Proposition 1: VS vs. Direct: Exponential Gain
  • Corollary 2: Identical Steps Case
  • Lemma 3: Expected Token Cost in VS
  • Lemma 4: VS Allows Strictly More Design Tokens
  • Proposition 3: Quality Gain from Viterbi-style Search Extended Tokens
  • Definition 1: Unit Tree for $\Sigma \to \Sigma$
  • Theorem 4: Unit Tree Factorization for $\Sigma \to \Sigma$
  • Definition 2: Lifted Function $\widetilde{Q}$
  • ...and 3 more