Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

A. Bochkov

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

A. Bochkov

TL;DR

The paper addresses the high cost and rigidity of monolithic pre-training for large language models by proposing a constructive learning paradigm that builds depth on a fixed, frozen embedding substrate. It introduces progressive layer-wise growth, where each new transformer layer is trained while prior layers remain frozen, with LoRA-based holistic fine-tuning at deeper depths under a constant trainable-parameter budget. Through a large-scale feasibility study and a controlled 9-layer comparison, the authors show that semantic capabilities emerge with depth and that frozen embeddings can outperform traditional monolithic training on reasoning benchmarks, given the right training regimen. This work suggests a path toward resource-efficient, modular AI development via a universal frozen docking substrate, with implications for continual learning, interpretability, and ecosystem-level collaboration.

Abstract

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer's deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

TL;DR

Abstract

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)