Table of Contents
Fetching ...

Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection

Tianxiang Chen, Zhentao Tan, Tao Gong, Yue Wu, Qi Chu, Bin Liu, Jieping Ye, Nenghai Yu

TL;DR

The S strategy is proposed, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones, and Llama Slayer-8B and Llama Slayer-8B-Instruct are introduced.

Abstract

As a manner to augment pre-trained large language models (LLM), knowledge injection is critical to develop vertical domain large models and has been widely studied. Although most current approaches, including parameter-efficient fine-tuning (PEFT) and block expansion methods, uniformly apply knowledge across all LLM layers, it raises the question: are all layers equally crucial for knowledge injection? We begin by evaluating the importance of each layer in finding the optimal layer range for knowledge injection. Intuitively, the more important layers should play a more critical role in knowledge injection and deserve a denser injection. We observe performance dips in question-answering benchmarks after the removal or expansion of the shallow layers, and the degradation shrinks as the layer gets deeper, indicating that the shallow layers hold the key to knowledge injection. This insight leads us to propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones. Based on this strategy, we introduce Llama Slayer-8B and Llama Slayer-8B-Instruct. We experimented on the corpus of code $\&$ math and demonstrated the effectiveness of our strategy. Further experiments across different LLM, Mistral-7B, and a legal corpus confirmed the general applicability of the approach, underscoring its wide-ranging efficacy. Our code is available at: \https://github.com/txchen-USTC/Llama-Slayer

Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection

TL;DR

The S strategy is proposed, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones, and Llama Slayer-8B and Llama Slayer-8B-Instruct are introduced.

Abstract

As a manner to augment pre-trained large language models (LLM), knowledge injection is critical to develop vertical domain large models and has been widely studied. Although most current approaches, including parameter-efficient fine-tuning (PEFT) and block expansion methods, uniformly apply knowledge across all LLM layers, it raises the question: are all layers equally crucial for knowledge injection? We begin by evaluating the importance of each layer in finding the optimal layer range for knowledge injection. Intuitively, the more important layers should play a more critical role in knowledge injection and deserve a denser injection. We observe performance dips in question-answering benchmarks after the removal or expansion of the shallow layers, and the degradation shrinks as the layer gets deeper, indicating that the shallow layers hold the key to knowledge injection. This insight leads us to propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones. Based on this strategy, we introduce Llama Slayer-8B and Llama Slayer-8B-Instruct. We experimented on the corpus of code math and demonstrated the effectiveness of our strategy. Further experiments across different LLM, Mistral-7B, and a legal corpus confirmed the general applicability of the approach, underscoring its wide-ranging efficacy. Our code is available at: \https://github.com/txchen-USTC/Llama-Slayer
Paper Structure (26 sections, 5 equations, 5 figures, 5 tables)

This paper contains 26 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Llama SLayer-8B-INSTRUCT achieves state-of-the-art performance across several tasks, spanning from general language tasks to specific domain tasks (programming and mathematics), outperforming its LLaMA series predecessors.
  • Figure 2: (a) Angular distance between inputs and outputs of each block vs. the layer number; (b) Accuracy of Llama2-7B and Mistral 7B dropping one layer on two general QA benchmarks; (c) Accuracy of Llama2-7B and Mistral 7B with one more inserted block on the same two QA benchmarks. The initialization of the expanded blocks includes identity copy and averaging the adjacent block weights.
  • Figure 3: Our strategy focuses on infusing domain-specific knowledge to the first half of the model layers (shallow layers) and the final layer. This is achieved by augmenting the model via block expansion after the deletion of layers deemed non-essential. The expanded blocks are initialized using a linear interpolation technique to ensure a coherent knowledge structure, and only the expanded blocks are fine-tuned.
  • Figure 4: The resulting rank of each incremental matrix when fine-tuning Llama2-7B on our 5B token math+code dataset with AdaLoRA.
  • Figure 5: Comparative analysis of performance variations of different training strategies relative to the base model (a) Llama2-7B and (b) Mistral-7B, on both general and law-specific tasks.