Beyond Student: An Asymmetric Network for Neural Network Inheritance
Yiyun Zhou, Jingwei Shi, Mingjing Xu, Zhonghua Jiang, Jingyuan Chen
TL;DR
This work tackles the capacity gap in knowledge distillation by introducing InherNet, which directly inherits both the knowledge and the structure of a pretrained teacher. It achieves this through an asymmetric low-rank decomposition using SVD-based initialization and a Mixture-of-Experts–style structure that deepens and widens the network without excessive disruption. The authors provide rigorous convergence guarantees and parameter-efficiency proofs, showing that InherNet can preserve expressive power while reducing parameters, and demonstrate superior or competitive performance across unimodal and multimodal tasks with faster convergence than traditional KD. The results suggest a practical pathway for efficient model compression beyond standard distillation, with broad applicability to vision, language, and multimodal systems. Overall, InherNet offers a principled, scalable approach to inheriting teacher knowledge that can inform future model compression and transfer learning strategies.
Abstract
Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacher's structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacher's weights and reconstructs a lightweight yet expressive network without significant architectural disruption. By leveraging Singular Value Decomposition (SVD) for initialization to ensure the inheritance of principal knowledge, InherNet effectively balances depth, width, and compression efficiency. Experimental results across unimodal and multimodal tasks demonstrate that InherNet achieves higher performance compared to student networks of similar parameter sizes. Our findings reveal a promising direction for future research in efficient model compression beyond traditional distillation.
