Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

Ming Zhong; Chenxin An; Weizhu Chen; Jiawei Han; Pengcheng He

Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

Ming Zhong, Chenxin An, Weizhu Chen, Jiawei Han, Pengcheng He

TL;DR

The paper addresses whether implicit, task-specific knowledge encoded in large language models can be transferred to smaller models across scales. It introduces a parametric knowledge transfer framework that first extracts task-relevant parameters from a larger teacher using sensitivity measurements and then injects them into a smaller student via LoRA, with layer mapping and dimensionality reduction to align architectures. Through experiments on four benchmarks, it provides empirical evidence of cross-scale parametric knowledge transfer and analyzes how factors like teacher size, initialization, seed-sample count, and the origin/structure of extracted parameters affect performance. The results offer a low-cost, scalable alternative to full distillation and yield practical guidelines for leveraging implicit knowledge to improve smaller LLMs, potentially democratizing access to strong capabilities. The findings underscore the importance of preserving parameter structure during extraction and indicate that FFN and higher layers often carry salient knowledge, with submatrix-level transfers delivering the strongest gains.

Abstract

Large Language Models (LLMs) inherently encode a wealth of knowledge within their parameters through pre-training on extensive corpora. While prior research has delved into operations on these parameters to manipulate the underlying implicit knowledge (encompassing detection, editing, and merging), there remains an ambiguous understanding regarding their transferability across models with varying scales. In this paper, we seek to empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. To achieve this, we employ sensitivity-based techniques to extract and align knowledge-specific parameters between different LLMs. Moreover, the LoRA module is used as the intermediary mechanism for injecting the extracted knowledge into smaller models. Evaluations across four benchmarks validate the efficacy of our proposed method. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer, underscoring the transferability of model parameters across LLMs of different scales. Project website: https://maszhongming.github.io/ParaKnowTransfer.

Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

TL;DR

Abstract

Paper Structure (33 sections, 10 equations, 6 figures, 6 tables)

This paper contains 33 sections, 10 equations, 6 figures, 6 tables.

Introduction
Related Work
Manipulation of Implicit Model Knowledge
Inheritance of Model Knowledge
Transfer of Model Knowledge
Parametric Knowledge Transfer
Task Formulation
Knowledge Extraction
Sensitivity of the Parameters.
Layer Selection and Dimensionality Reduction.
Knowledge Injection
LoRA Module.
Knowledge Injection with LoRA.
Experiments
Experimental Setup
...and 18 more sections

Figures (6)

Figure 1: Different paradigms of knowledge transfer from teacher models to student models. (a) Online Distillation: utilizing soft logits from the fine-tuned teacher model to guide the training of the student model; (b) Offline Distillation: generating a distilled dataset that encapsulates the knowledge of the teacher model to fine-tune the student model. (c) Parametric Knowledge Transfer: extracting knowledge-specific parameters from the vanilla teacher model and injecting them into the student model to enhance its training efficacy.
Figure 2: Overview of our parametric knowledge transfer framework. Starting with the teacher model, we compute sensitivity metrics using a set of seed samples, which aids in the extraction of task-specific knowledge. Subsequently, the extracted parameter matrices are factorized to initialize the student model's LoRA module, serving as a bridge for knowledge injection.
Figure 3: Comparison of different initialization strategies. The y-axis represents the average score over four datasets. "13B to 7B (Sensitivity)" refers to initializing the LoRA module in the 7B model with submatrices from the 13B model based on sensitivity score. The orange dotted line denotes the result of fine-tuning 7B-LoRA without knowledge transfer.
Figure 4: Analysis of how the quantity of seed samples affects student performance.
Figure 5: Analysis of various aspects of extracted parameters from teacher models. The y-axis begins with the result of direct fine-tuning students without knowledge injection.
...and 1 more figures

Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

TL;DR

Abstract

Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (6)