Table of Contents
Fetching ...

Beyond Student: An Asymmetric Network for Neural Network Inheritance

Yiyun Zhou, Jingwei Shi, Mingjing Xu, Zhonghua Jiang, Jingyuan Chen

TL;DR

This work tackles the capacity gap in knowledge distillation by introducing InherNet, which directly inherits both the knowledge and the structure of a pretrained teacher. It achieves this through an asymmetric low-rank decomposition using SVD-based initialization and a Mixture-of-Experts–style structure that deepens and widens the network without excessive disruption. The authors provide rigorous convergence guarantees and parameter-efficiency proofs, showing that InherNet can preserve expressive power while reducing parameters, and demonstrate superior or competitive performance across unimodal and multimodal tasks with faster convergence than traditional KD. The results suggest a practical pathway for efficient model compression beyond standard distillation, with broad applicability to vision, language, and multimodal systems. Overall, InherNet offers a principled, scalable approach to inheriting teacher knowledge that can inform future model compression and transfer learning strategies.

Abstract

Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacher's structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacher's weights and reconstructs a lightweight yet expressive network without significant architectural disruption. By leveraging Singular Value Decomposition (SVD) for initialization to ensure the inheritance of principal knowledge, InherNet effectively balances depth, width, and compression efficiency. Experimental results across unimodal and multimodal tasks demonstrate that InherNet achieves higher performance compared to student networks of similar parameter sizes. Our findings reveal a promising direction for future research in efficient model compression beyond traditional distillation.

Beyond Student: An Asymmetric Network for Neural Network Inheritance

TL;DR

This work tackles the capacity gap in knowledge distillation by introducing InherNet, which directly inherits both the knowledge and the structure of a pretrained teacher. It achieves this through an asymmetric low-rank decomposition using SVD-based initialization and a Mixture-of-Experts–style structure that deepens and widens the network without excessive disruption. The authors provide rigorous convergence guarantees and parameter-efficiency proofs, showing that InherNet can preserve expressive power while reducing parameters, and demonstrate superior or competitive performance across unimodal and multimodal tasks with faster convergence than traditional KD. The results suggest a practical pathway for efficient model compression beyond standard distillation, with broad applicability to vision, language, and multimodal systems. Overall, InherNet offers a principled, scalable approach to inheriting teacher knowledge that can inform future model compression and transfer learning strategies.

Abstract

Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacher's structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacher's weights and reconstructs a lightweight yet expressive network without significant architectural disruption. By leveraging Singular Value Decomposition (SVD) for initialization to ensure the inheritance of principal knowledge, InherNet effectively balances depth, width, and compression efficiency. Experimental results across unimodal and multimodal tasks demonstrate that InherNet achieves higher performance compared to student networks of similar parameter sizes. Our findings reveal a promising direction for future research in efficient model compression beyond traditional distillation.
Paper Structure (64 sections, 20 theorems, 43 equations, 5 figures, 14 tables, 1 algorithm)

This paper contains 64 sections, 20 theorems, 43 equations, 5 figures, 14 tables, 1 algorithm.

Key Result

Theorem 2.1

Given a weight matrix $W \in \mathbb{R}^{m \times n}$ with singular value decomposition $W = U \Sigma V^\top$, the optimal rank-$r$ approximation in terms of Frobenius norm is given by: $W_r = U_{[:, :r]} \Sigma_{[:r, :r]} V_{[:, :r]}^\top$. Moreover, the approximation error is minimized, specifical where $\sigma_i(W)$ denotes the $i$-th singular value of $W$golub1987generalizationeckart1936approx

Figures (5)

  • Figure 1: The difference between LoRA and NNI.
  • Figure 2: Overview of the proposed InherNet (e.g., decomposition of a convolutional layer). InherNet consists of two parts: (a) Knowledge Inheritance and (b) Structure Inheritance. Note that the low-rank decomposition of linear layer is similar, with the key being to satisfy the properties of the 2D SVD operation.
  • Figure 3: The impact of distillation on InherNet of different scales with varying ranks.
  • Figure 4: The impact of rank size $r$ and the number of expert heads $H$ on the performance of the proposed InherNet.
  • Figure 5: The training and testing loss curves of InherNet and various KD methods during training.

Theorems & Definitions (37)

  • Theorem 2.1: Eckart-Young-Mirsky theorem
  • Lemma 2.2: Gradient Decomposition
  • Proposition 2.3: Stability via Orthonormal Initialization
  • Theorem 2.4: Non-Convex Convergence
  • Definition 2.1: Parameter Efficiency (PE)
  • Definition 2.2: Approximation Error
  • Definition 2.3: Expressivity-to-Parameter Ratio (EPR)
  • Theorem 2.5: Parameter Reduction Bounds
  • Lemma 2.6: Spectral Energy Preservation
  • Proposition 2.7: Knowledge Preservation Rate
  • ...and 27 more