Beyond Student: An Asymmetric Network for Neural Network Inheritance

Yiyun Zhou; Jingwei Shi; Mingjing Xu; Zhonghua Jiang; Jingyuan Chen

Beyond Student: An Asymmetric Network for Neural Network Inheritance

Yiyun Zhou, Jingwei Shi, Mingjing Xu, Zhonghua Jiang, Jingyuan Chen

TL;DR

This work tackles the capacity gap in knowledge distillation by introducing InherNet, which directly inherits both the knowledge and the structure of a pretrained teacher. It achieves this through an asymmetric low-rank decomposition using SVD-based initialization and a Mixture-of-Experts–style structure that deepens and widens the network without excessive disruption. The authors provide rigorous convergence guarantees and parameter-efficiency proofs, showing that InherNet can preserve expressive power while reducing parameters, and demonstrate superior or competitive performance across unimodal and multimodal tasks with faster convergence than traditional KD. The results suggest a practical pathway for efficient model compression beyond standard distillation, with broad applicability to vision, language, and multimodal systems. Overall, InherNet offers a principled, scalable approach to inheriting teacher knowledge that can inform future model compression and transfer learning strategies.

Abstract

Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacher's structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacher's weights and reconstructs a lightweight yet expressive network without significant architectural disruption. By leveraging Singular Value Decomposition (SVD) for initialization to ensure the inheritance of principal knowledge, InherNet effectively balances depth, width, and compression efficiency. Experimental results across unimodal and multimodal tasks demonstrate that InherNet achieves higher performance compared to student networks of similar parameter sizes. Our findings reveal a promising direction for future research in efficient model compression beyond traditional distillation.

Beyond Student: An Asymmetric Network for Neural Network Inheritance

TL;DR

Abstract

Paper Structure (64 sections, 20 theorems, 43 equations, 5 figures, 14 tables, 1 algorithm)

This paper contains 64 sections, 20 theorems, 43 equations, 5 figures, 14 tables, 1 algorithm.

Introduction
Methodology
InherNet
Knowledge Inheritance
Structure Inheritance
Theoretical Analysis of Convergence Guarantees
Gradient Decomposition under InherNet
Effect of SVD-Based Initialization and Convergence Guarantee
Theoretical Proof of Parameter Efficiency
Compression Bounds for InherNet
Representational Power Preservation
Experiments
Vision: Image Classification
Results on CIFAR-100
Results on ImageNet
...and 49 more sections

Key Result

Theorem 2.1

Given a weight matrix $W \in \mathbb{R}^{m \times n}$ with singular value decomposition $W = U \Sigma V^\top$, the optimal rank-$r$ approximation in terms of Frobenius norm is given by: $W_r = U_{[:, :r]} \Sigma_{[:r, :r]} V_{[:, :r]}^\top$. Moreover, the approximation error is minimized, specifical where $\sigma_i(W)$ denotes the $i$-th singular value of $W$golub1987generalizationeckart1936approx

Figures (5)

Figure 1: The difference between LoRA and NNI.
Figure 2: Overview of the proposed InherNet (e.g., decomposition of a convolutional layer). InherNet consists of two parts: (a) Knowledge Inheritance and (b) Structure Inheritance. Note that the low-rank decomposition of linear layer is similar, with the key being to satisfy the properties of the 2D SVD operation.
Figure 3: The impact of distillation on InherNet of different scales with varying ranks.
Figure 4: The impact of rank size $r$ and the number of expert heads $H$ on the performance of the proposed InherNet.
Figure 5: The training and testing loss curves of InherNet and various KD methods during training.

Theorems & Definitions (37)

Theorem 2.1: Eckart-Young-Mirsky theorem
Lemma 2.2: Gradient Decomposition
Proposition 2.3: Stability via Orthonormal Initialization
Theorem 2.4: Non-Convex Convergence
Definition 2.1: Parameter Efficiency (PE)
Definition 2.2: Approximation Error
Definition 2.3: Expressivity-to-Parameter Ratio (EPR)
Theorem 2.5: Parameter Reduction Bounds
Lemma 2.6: Spectral Energy Preservation
Proposition 2.7: Knowledge Preservation Rate
...and 27 more

Beyond Student: An Asymmetric Network for Neural Network Inheritance

TL;DR

Abstract

Beyond Student: An Asymmetric Network for Neural Network Inheritance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (37)