Table of Contents
Fetching ...

Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference

Parsa Kavehzadeh, Mojtaba Valipour, Marzieh Tahaei, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh

TL;DR

This work addresses the high computation costs of deploying large language models by enabling dynamic inference through Sorted Fine-Tuning (SoFT), extending the SortedNet approach to generative LLMs without pre-training. It forms eight depth-based sub-models within LLaMA 2 13B by sharing a single output head and training on Alpaca and TriviaQA, demonstrating that intermediate layers can generate meaningful outputs and approach full-model capability. The study shows SoFT consistently outperforms Standard Fine-Tuning and SFT+ICT across instruction-following and QA tasks, and enables speedups via speculative decoding and instance-aware dynamic inference with minimal storage overhead. The results indicate a practical pathway to deploy efficient, dynamic LLMs in real-time settings, reducing computational budgets while maintaining performance.

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) by excelling at understanding and generating human-like text. However, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference by leveraging the modularity in networks and sorting sub-models based on computation/accuracy in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any Pre-Training and by only replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT). Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that this approach can unlock the power of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. The efficacy of our proposed method was demonstrated by applying it to tune LLaMA 2 13B on the Stanford Alpaca dataset for instruction following and TriviaQA for closed-book question answering. Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit), all achieved with efficient tuning and without additional memory usage during inference.

Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference

TL;DR

This work addresses the high computation costs of deploying large language models by enabling dynamic inference through Sorted Fine-Tuning (SoFT), extending the SortedNet approach to generative LLMs without pre-training. It forms eight depth-based sub-models within LLaMA 2 13B by sharing a single output head and training on Alpaca and TriviaQA, demonstrating that intermediate layers can generate meaningful outputs and approach full-model capability. The study shows SoFT consistently outperforms Standard Fine-Tuning and SFT+ICT across instruction-following and QA tasks, and enables speedups via speculative decoding and instance-aware dynamic inference with minimal storage overhead. The results indicate a practical pathway to deploy efficient, dynamic LLMs in real-time settings, reducing computational budgets while maintaining performance.

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) by excelling at understanding and generating human-like text. However, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference by leveraging the modularity in networks and sorting sub-models based on computation/accuracy in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any Pre-Training and by only replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT). Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that this approach can unlock the power of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. The efficacy of our proposed method was demonstrated by applying it to tune LLaMA 2 13B on the Stanford Alpaca dataset for instruction following and TriviaQA for closed-book question answering. Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit), all achieved with efficient tuning and without additional memory usage during inference.
Paper Structure (28 sections, 2 equations, 5 figures, 9 tables)

This paper contains 28 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: SoFT vs. SFT + ICT (Early-Exit) (Left) and SoFT vs. SFT (Right). Note that for our SoFT method, the output prediction layer is shared between all sub-models whereas, for Early-Exit, a separate prediction head is learned per sub-model, making inference inefficient. Both SoFT and SFT had equivalent training time (2 Epochs) in this experiment. The number in each cell is calculated by considering wins as the times SoFT sub-models (rows) were preferred, losses as the times SFT sub-models (columns) were preferred and ties when non of them were preferred (Equation \ref{['eq:score']}). Algorithm performance is correlated to cell whiteness: white is better, zero is on-par, dark is worse.
  • Figure 2: SoFT vs. Extracted Fine-Tuning. The left figure shows an equal training time setup (2 epochs), and the figure on the right considers two extra training epochs for SoFT.
  • Figure 3: The results of TriviaQA. We reported case-sensitive exact match accuracy as the main metric. SFT+ICT and Extracted Fine-Tuned results can be found in Epochs 2, as we found Epoch 2 checkpoint saturated for the original SFT experiment (main LLaMA2 13b model with 40 layers).
  • Figure 4: An inter-model comparison of sub-models based on output logits and hidden state cosine similarity. The numbers are average of all 170 samples in the PandaLM validation set. The similarity is stronger if the cell is darker.
  • Figure 5: An intra-model comparison of sub-models based on output logits and hidden state cosine similarity. The similarity is stronger if the cell is darker.