Table of Contents
Fetching ...

NanoNet: Parameter-Efficient Learning with Label-Scarce Supervision for Lightweight Text Mining Model

Qianren Mao, Yashuo Luo, Ziqi Qin, Junnan Liu, Weifeng Jiang, Zhijun Chen, Zhuoran Li, Likang Xiao, Chuou Xu, Qili Zhang, Hanwen Hao, Jingzheng Li, Chunghua Lin, Jianxin Li, Philip S. Yu

TL;DR

NanoNet tackles the challenge of building lightweight text classifiers under extreme label scarcity by combining offline knowledge distillation, mutual learning among small student cohorts, and parameter-efficient fine-tuning. The framework leverages sequential unpadding, alternating attention, and Flash Attention to maximize efficiency, while updating only bias terms to minimize training cost. Empirical results on multiple SSL benchmarks show NanoNet achieving competitive accuracy with orders of magnitude fewer trainable parameters than strong baselines, and with modest inference overhead. This approach offers a practical path for deploying encoder-only PLMs in resource-constrained settings without sacrificing performance.

Abstract

The lightweight semi-supervised learning (LSL) strategy provides an effective approach of conserving labeled samples and minimizing model inference costs. Prior research has effectively applied knowledge transfer learning and co-training regularization from large to small models in LSL. However, such training strategies are computationally intensive and prone to local optima, thereby increasing the difficulty of finding the optimal solution. This has prompted us to investigate the feasibility of integrating three low-cost scenarios for text mining tasks: limited labeled supervision, lightweight fine-tuning, and rapid-inference small models. We propose NanoNet, a novel framework for lightweight text mining that implements parameter-efficient learning with limited supervision. It employs online knowledge distillation to generate multiple small models and enhances their performance through mutual learning regularization. The entire process leverages parameter-efficient learning, reducing training costs and minimizing supervision requirements, ultimately yielding a lightweight model for downstream inference.

NanoNet: Parameter-Efficient Learning with Label-Scarce Supervision for Lightweight Text Mining Model

TL;DR

NanoNet tackles the challenge of building lightweight text classifiers under extreme label scarcity by combining offline knowledge distillation, mutual learning among small student cohorts, and parameter-efficient fine-tuning. The framework leverages sequential unpadding, alternating attention, and Flash Attention to maximize efficiency, while updating only bias terms to minimize training cost. Empirical results on multiple SSL benchmarks show NanoNet achieving competitive accuracy with orders of magnitude fewer trainable parameters than strong baselines, and with modest inference overhead. This approach offers a practical path for deploying encoder-only PLMs in resource-constrained settings without sacrificing performance.

Abstract

The lightweight semi-supervised learning (LSL) strategy provides an effective approach of conserving labeled samples and minimizing model inference costs. Prior research has effectively applied knowledge transfer learning and co-training regularization from large to small models in LSL. However, such training strategies are computationally intensive and prone to local optima, thereby increasing the difficulty of finding the optimal solution. This has prompted us to investigate the feasibility of integrating three low-cost scenarios for text mining tasks: limited labeled supervision, lightweight fine-tuning, and rapid-inference small models. We propose NanoNet, a novel framework for lightweight text mining that implements parameter-efficient learning with limited supervision. It employs online knowledge distillation to generate multiple small models and enhances their performance through mutual learning regularization. The entire process leverages parameter-efficient learning, reducing training costs and minimizing supervision requirements, ultimately yielding a lightweight model for downstream inference.
Paper Structure (36 sections, 22 equations, 5 figures, 4 tables)

This paper contains 36 sections, 22 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The visualization of the Center Kernel Alignment (CKA ZhuW21a) scores of PsNet in Subfigures (c) and (d), along with its ablation variant, SingleStudent, shown in Subfigures (a) and (b). All models are equipped with 6-layer BERT. The evaluation is conducted on text classification tasks using the AG News dataset, featuring 10 labeled data instances per class.
  • Figure 2: The lightweight training modules and settings for the (a) NanoNet, (b) PsNet and (c) DisCo during the training and inference stages are as follows: a minus sign represents the lightweight setting, a plus sign represents an additional module, SSL stands for Semi-Supervised Learning, the acronym TOT denotes 'Teacher and Student training with online knowledge distillation,' whereas TFT signifies the offline variant. DML represents Deep Mutual Learning.
  • Figure 3: (LEFT) Framework overview of the NanoNet framework. The Teacher model heuristically delegates distinct layers to distinct student models and throughout training only bias parameters are updated while all remaining weights stay frozen. (RIGHT) Schematic of Encoder Block. Following ModernBERT, unpadding is fused into a single Flash-Attention kernel by supplying the sequence-boundary indices obtained from tokenization directly as inputs to the attention computation. Global attention is applied at fixed sparse intervals with a high-capacity RoPE (Rotary Position Embedding), while the remaining layers employ local sliding-window attention with a correspondingly reduced RoPE.
  • Figure 4: Freezing or unfreezing embeddings exerts a measurable influence on performance results. Adjacent bars represent the two student networks internal to NanoNet, and the figure superimposes the mean accuracy averaged across both students under frozen-embedding and unfrozen-embedding conditions.
  • Figure 5: Accuracy Surface of Student Peers Across labeled-Sample Counts. X-axis: labeled-sample count. Y-axis: model ID. Z-axis: accuracy. Scatter points show student peers performance under equal labelling budgets with surface obtained by linear interpolation.