Table of Contents
Fetching ...

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

A. Bochkov

TL;DR

The paper addresses the high cost and rigidity of monolithic pre-training for large language models by proposing a constructive learning paradigm that builds depth on a fixed, frozen embedding substrate. It introduces progressive layer-wise growth, where each new transformer layer is trained while prior layers remain frozen, with LoRA-based holistic fine-tuning at deeper depths under a constant trainable-parameter budget. Through a large-scale feasibility study and a controlled 9-layer comparison, the authors show that semantic capabilities emerge with depth and that frozen embeddings can outperform traditional monolithic training on reasoning benchmarks, given the right training regimen. This work suggests a path toward resource-efficient, modular AI development via a universal frozen docking substrate, with implications for continual learning, interpretability, and ecosystem-level collaboration.

Abstract

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer's deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

TL;DR

The paper addresses the high cost and rigidity of monolithic pre-training for large language models by proposing a constructive learning paradigm that builds depth on a fixed, frozen embedding substrate. It introduces progressive layer-wise growth, where each new transformer layer is trained while prior layers remain frozen, with LoRA-based holistic fine-tuning at deeper depths under a constant trainable-parameter budget. Through a large-scale feasibility study and a controlled 9-layer comparison, the authors show that semantic capabilities emerge with depth and that frozen embeddings can outperform traditional monolithic training on reasoning benchmarks, given the right training regimen. This work suggests a path toward resource-efficient, modular AI development via a universal frozen docking substrate, with implications for continual learning, interpretability, and ecosystem-level collaboration.

Abstract

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer's deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.

Paper Structure

This paper contains 24 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Training dynamics during progressive layer-wise growth. Each loss spike marks the stacking of a new layer, followed by rapid convergence. The ARC-c metric shows a corresponding increase in capability.
  • Figure 2: Benchmark performance as a function of model depth. Note the significant jump in SQuAD score at 'n_layer=3', indicating the emergence of complex reasoning.
  • Figure 3: MMLU performance on select subjects as a function of model depth, illustrating how different reasoning capabilities strengthen as the model grows.
  • Figure 4: t-SNE visualization of frozen input token embeddings for 'abs-bvv-6'. In the this frozen embeddings, semantic groups (e.g., numbers, professions, animals) located randomly and without clustering. This fully corresponds to the initial condition of an absence of semantics in the visually precomputed embedding layer.
  • Figure 5: t-SNE visualization from layer 0 to layer 5 of attention projections (v_proj and o_proj), and transformer block outputs for 'abs-bvv-6' with frozen input embedding layer. In contrast to the input layer, clear semantic clusters emerge and sharpen with depth, demonstrating that meaning is constructed compositionally by the network.
  • ...and 6 more figures