Table of Contents
Fetching ...

DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs

Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen

TL;DR

DLO introduces a dynamic vertical scaling mechanism for transformer-based LLMs that expands, activates, or skips layers during Supervised Fine-Tuning to improve efficiency without CPT. It employs a group-based layer expansion strategy, similarity-guided layer activation, and a router-driven skip mechanism with similarity-induced supervision, per-layer sparsity, and annealed skip dynamics to balance accuracy and compute. Training combines the downstream task loss with a router-skip loss, and inference uses token-level adaptive FLOPs, enabling significant cost savings while preserving performance. Empirical results on LLaMA2-7B demonstrate that dense DLO expansion can surpass the original model and approach dense CPT-based models in performance, while sparse DLO variants deliver strong task performance with substantially reduced FLOPs. The work offers a practical, scalable path for building efficient yet powerful LLMs and includes extensive ablations and ethical considerations for responsible deployment.

Abstract

In this paper, we introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs) by dynamically expanding, activating, or skipping layers using a sophisticated routing policy based on layerwise feature similarity. Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples. Our framework is integrated with the Supervised Fine-Tuning (SFT) stage, eliminating the need for resource-intensive Continual Pre-Training (CPT). Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency. Our work offers a promising direction for building efficient yet powerful LLMs. We will release our implementation and model weights upon acceptance.

DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs

TL;DR

DLO introduces a dynamic vertical scaling mechanism for transformer-based LLMs that expands, activates, or skips layers during Supervised Fine-Tuning to improve efficiency without CPT. It employs a group-based layer expansion strategy, similarity-guided layer activation, and a router-driven skip mechanism with similarity-induced supervision, per-layer sparsity, and annealed skip dynamics to balance accuracy and compute. Training combines the downstream task loss with a router-skip loss, and inference uses token-level adaptive FLOPs, enabling significant cost savings while preserving performance. Empirical results on LLaMA2-7B demonstrate that dense DLO expansion can surpass the original model and approach dense CPT-based models in performance, while sparse DLO variants deliver strong task performance with substantially reduced FLOPs. The work offers a practical, scalable path for building efficient yet powerful LLMs and includes extensive ablations and ethical considerations for responsible deployment.

Abstract

In this paper, we introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs) by dynamically expanding, activating, or skipping layers using a sophisticated routing policy based on layerwise feature similarity. Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples. Our framework is integrated with the Supervised Fine-Tuning (SFT) stage, eliminating the need for resource-intensive Continual Pre-Training (CPT). Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency. Our work offers a promising direction for building efficient yet powerful LLMs. We will release our implementation and model weights upon acceptance.
Paper Structure (32 sections, 8 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 8 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) DLO structure that ensembles human brain activities in a math problem example koechlin2003architecture, where the primary neurons preceive numbers, secondary neurons understand operations, and high-order neurons calclulate the results. (b) Layer-Wise token similarity and distribution.
  • Figure 2: Layer extension with initialization strategies.
  • Figure 3: Training pipeline of DLO, consists of the downstream task loss and an auxilliary router skip loss supervised by generated router labels.
  • Figure 4: Visualization on different datasets of (a) Layer-Wise Number of Activations, (b) Layer-Wise Average Similarity, and (c) Token Activation Examples.
  • Figure 5: (a) Performance v.s. Training time. LLaMA-Pro is reported in H800 GPU hours quoted from the original paper. The rests are reported in A100 GPU hours. (b) Performance v.s. Inference FLOPs. DLO achieves the best trade-off between performance and training or inference costs.