Table of Contents
Fetching ...

Training Language Models via Neural Cellular Automata

Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal

TL;DR

This work proposes using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language, and finds that attention layers are the most transferable, and that optimal NCA complexity varies by domain.

Abstract

Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.

Training Language Models via Neural Cellular Automata

TL;DR

This work proposes using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language, and finds that attention layers are the most transferable, and that optimal NCA complexity varies by domain.

Abstract

Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.
Paper Structure (36 sections, 2 equations, 9 figures, 2 tables)

This paper contains 36 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of NCA Pre-pre-training to Language Pre-training. We pre-pre-train a transformer with next-token prediction on the dynamics of neural cellular automata (NCA) sampled from selected complexity regions. We then conduct standard pre-training on natural language corpora. NCA pre-pre-training improves both validation perplexity and convergence speed on language pre-training. Interestingly, the optimal NCA distribution varies by downstream domain.
  • Figure 2: NCA pre-pre-training improves and accelerates language model pre-training across diverse domains. We show the validation perplexity during pre-training on (a) OpenWebText, (b) OpenWebMath, and (c) CodeParrot for 1.6B parameter models. Models pre-pre-trained on NCA trajectories consistently outperform the scratch, Dyck pre-pre-training, and surprisingly even C4 pre-pre-training baselines. NCA pre-pre-training achieves 1.4--1.6$\times$ faster convergence to the scratch baseline's final perplexity while also reaching up to 6% lower final perplexity. We provide a zoomed-in training curve of the last third of training for clarity.
  • Figure 3: NCA pre-pre-training improves language model training performance across model sizes (Section \ref{['subsec:nca_transfer']}). We report the final validation perplexity after pre-training on OpenWebText across (400M, 600M, and 1.6B parameter models). At 164M tokens, C4 pre-pre-training likely acquires shallow syntactic patterns that interfere with downstream learning rather than transferable structure. We investigate this further in Figure \ref{['fig:c4-1.6B']}.
  • Figure 4: Pre-pre-training on 160M tokens of NCA is better than pre-pre-training on 1.6B tokens of natural language (C4). We report the validation perplexity during pre-training on OpenWebText. Perplexity improvement is calculated relative to the C4 pre-pre-trained model. We add a version where we also preserve the embedding layers from pre-pre-training to pre-training (1.6B tokens w/o embedding reinit). Surprisingly even with the embedding layers, NCA pre-pre-training is better.
  • Figure 5: Attention weights are most crucial for positive transfer. We report the change in validation perplexity when selectively re-initializing model components after NCA pre-pre-training, relative to full transfer. Higher means the component is more important for transfer. Re-initializing attention causes the largest degradation across both OpenWebText and CodeParrot, while MLP and LayerNorm effects are domain-dependent.
  • ...and 4 more figures