Table of Contents
Fetching ...

Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER

Junyi Zhu, Savas Ozkan, Andrea Maracani, Sinan Mutlu, Cho Jung Min, Mete Ozay

TL;DR

This work tackles the challenge of adapting lightweight transformer encoders to multiple NLP tasks on mobile devices. It first analyzes two strong pre-finetuning strategies—NER-focused distillation and text-classification contrastive learning—and shows that naïve multi-task pre-finetuning degrades performance due to conflicting representational needs. To reconcile them, the authors propose Multi-Task Pre-Finetuning with Task-Primary LoRAs (MTPF-TPL), a framework that attaches task-specific LoRA adapters to the last few transformer layers while allowing the backbone to be updated by both task objectives; after pre-finetuning, the backbone remains centralized while adapters are distributed to applications. Empirically, MTPF-TPL achieves performance close to individually pre-finetuned models across 21 tasks, with average gains of $+0.8\%$ for NER and $+8.8\%$ for text classification, and shows strong gains in low-resource NER scenarios (e.g., $+8.4\%$ with only $10\%$ of data). This approach preserves deployment constraints by maintaining a single backbone and modular adapters, enabling versatile on-device NLP without sacrificing task-specific performance.

Abstract

Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that naïve multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.

Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER

TL;DR

This work tackles the challenge of adapting lightweight transformer encoders to multiple NLP tasks on mobile devices. It first analyzes two strong pre-finetuning strategies—NER-focused distillation and text-classification contrastive learning—and shows that naïve multi-task pre-finetuning degrades performance due to conflicting representational needs. To reconcile them, the authors propose Multi-Task Pre-Finetuning with Task-Primary LoRAs (MTPF-TPL), a framework that attaches task-specific LoRA adapters to the last few transformer layers while allowing the backbone to be updated by both task objectives; after pre-finetuning, the backbone remains centralized while adapters are distributed to applications. Empirically, MTPF-TPL achieves performance close to individually pre-finetuned models across 21 tasks, with average gains of for NER and for text classification, and shows strong gains in low-resource NER scenarios (e.g., with only of data). This approach preserves deployment constraints by maintaining a single backbone and modular adapters, enabling versatile on-device NLP without sacrificing task-specific performance.

Abstract

Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that naïve multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.

Paper Structure

This paper contains 30 sections, 2 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: An illustration of the practical deployment setting on mobile device. Mobile applications (APP-$\{1\ldots4\}$) use the language model by calling system API and sending their job with task-specific model adapters. Adapters are often in form of LoRA or linear classifier.
  • Figure 2: Illustration of different model configurations for processing inputs: (a) NER model predicts the entity type of each token; (b) text classification model extracts a text embedding representing the entire input; (c) our proposed shared encoder with task-primary (TP) LoRAs supports diverse outputs, with each LoRA module dedicated to its specific task.
  • Figure 3: Similarity trends during pre-finetuning: (a) perturbed sentences, as illustrated in \ref{['fig:entity-perturbation']}, become closer in embedding space when the model is optimized for NER; (b) token embeddings within the same sentence become more homogeneous when the model is optimized for text classification (TC).
  • Figure 4: Comparison of applying task-primary LoRAs (TPL) to varying numbers of final layers. (a) and (b) show pre-finetuning loss over training steps for different numbers of layers augmented with TPL. (c) and (d) present downstream performance across both task families under varying numbers of TPL-applied layers.
  • Figure 5: Example of entity replacement. Location and person entities (highlighted) have been substituted with random entities of the same type.