Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER
Junyi Zhu, Savas Ozkan, Andrea Maracani, Sinan Mutlu, Cho Jung Min, Mete Ozay
TL;DR
This work tackles the challenge of adapting lightweight transformer encoders to multiple NLP tasks on mobile devices. It first analyzes two strong pre-finetuning strategies—NER-focused distillation and text-classification contrastive learning—and shows that naïve multi-task pre-finetuning degrades performance due to conflicting representational needs. To reconcile them, the authors propose Multi-Task Pre-Finetuning with Task-Primary LoRAs (MTPF-TPL), a framework that attaches task-specific LoRA adapters to the last few transformer layers while allowing the backbone to be updated by both task objectives; after pre-finetuning, the backbone remains centralized while adapters are distributed to applications. Empirically, MTPF-TPL achieves performance close to individually pre-finetuned models across 21 tasks, with average gains of $+0.8\%$ for NER and $+8.8\%$ for text classification, and shows strong gains in low-resource NER scenarios (e.g., $+8.4\%$ with only $10\%$ of data). This approach preserves deployment constraints by maintaining a single backbone and modular adapters, enabling versatile on-device NLP without sacrificing task-specific performance.
Abstract
Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that naïve multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.
