NanoNet: Parameter-Efficient Learning with Label-Scarce Supervision for Lightweight Text Mining Model
Qianren Mao, Yashuo Luo, Ziqi Qin, Junnan Liu, Weifeng Jiang, Zhijun Chen, Zhuoran Li, Likang Xiao, Chuou Xu, Qili Zhang, Hanwen Hao, Jingzheng Li, Chunghua Lin, Jianxin Li, Philip S. Yu
TL;DR
NanoNet tackles the challenge of building lightweight text classifiers under extreme label scarcity by combining offline knowledge distillation, mutual learning among small student cohorts, and parameter-efficient fine-tuning. The framework leverages sequential unpadding, alternating attention, and Flash Attention to maximize efficiency, while updating only bias terms to minimize training cost. Empirical results on multiple SSL benchmarks show NanoNet achieving competitive accuracy with orders of magnitude fewer trainable parameters than strong baselines, and with modest inference overhead. This approach offers a practical path for deploying encoder-only PLMs in resource-constrained settings without sacrificing performance.
Abstract
The lightweight semi-supervised learning (LSL) strategy provides an effective approach of conserving labeled samples and minimizing model inference costs. Prior research has effectively applied knowledge transfer learning and co-training regularization from large to small models in LSL. However, such training strategies are computationally intensive and prone to local optima, thereby increasing the difficulty of finding the optimal solution. This has prompted us to investigate the feasibility of integrating three low-cost scenarios for text mining tasks: limited labeled supervision, lightweight fine-tuning, and rapid-inference small models. We propose NanoNet, a novel framework for lightweight text mining that implements parameter-efficient learning with limited supervision. It employs online knowledge distillation to generate multiple small models and enhances their performance through mutual learning regularization. The entire process leverages parameter-efficient learning, reducing training costs and minimizing supervision requirements, ultimately yielding a lightweight model for downstream inference.
