DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs
Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen
TL;DR
DLO introduces a dynamic vertical scaling mechanism for transformer-based LLMs that expands, activates, or skips layers during Supervised Fine-Tuning to improve efficiency without CPT. It employs a group-based layer expansion strategy, similarity-guided layer activation, and a router-driven skip mechanism with similarity-induced supervision, per-layer sparsity, and annealed skip dynamics to balance accuracy and compute. Training combines the downstream task loss with a router-skip loss, and inference uses token-level adaptive FLOPs, enabling significant cost savings while preserving performance. Empirical results on LLaMA2-7B demonstrate that dense DLO expansion can surpass the original model and approach dense CPT-based models in performance, while sparse DLO variants deliver strong task performance with substantially reduced FLOPs. The work offers a practical, scalable path for building efficient yet powerful LLMs and includes extensive ablations and ethical considerations for responsible deployment.
Abstract
In this paper, we introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs) by dynamically expanding, activating, or skipping layers using a sophisticated routing policy based on layerwise feature similarity. Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples. Our framework is integrated with the Supervised Fine-Tuning (SFT) stage, eliminating the need for resource-intensive Continual Pre-Training (CPT). Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency. Our work offers a promising direction for building efficient yet powerful LLMs. We will release our implementation and model weights upon acceptance.
