Table of Contents
Fetching ...

DALLMi: Domain Adaption for LLM-based Multi-label Classifier

Miruna Beţianu, Abele Mălan, Marco Aldinucci, Robert Birke, Lydia Chen

TL;DR

The paper tackles domain shift in multi-label text classification with limited target-domain labels by introducing DALLMi, a semi-supervised domain adaptation framework for BERT-based multi-label classifiers. It combines a per-label variational loss with embedding-level MixUp regularization and a label-balanced sampling strategy to leverage scarce positive labels and abundant unlabeled data from the target domain. Building on Positive-Unlabeled learning (VPU), DALLMi uses embedding interpolation (LERP) to generate synthetic samples and constrain learning through a norm-based variational objective. Across PubMed, ArXiv, and Movies, DALLMi outperforms unsupervised and partially supervised baselines by large margins, with ablations validating the effectiveness of the norm-based loss and embedding MixUp. The approach offers a practical pathway for robust, data-efficient domain adaptation of LLM-based multi-label classifiers, and the authors provide public code to support reproducibility.

Abstract

Large language models (LLMs) increasingly serve as the backbone for classifying text associated with distinct domains and simultaneously several labels (classes). When encountering domain shifts, e.g., classifier of movie reviews from IMDb to Rotten Tomatoes, adapting such an LLM-based multi-label classifier is challenging due to incomplete label sets at the target domain and daunting training overhead. The existing domain adaptation methods address either image multi-label classifiers or text binary classifiers. In this paper, we design DALLMi, Domain Adaptation Large Language Model interpolator, a first-of-its-kind semi-supervised domain adaptation method for text data models based on LLMs, specifically BERT. The core of DALLMi is the novel variation loss and MixUp regularization, which jointly leverage the limited positively labeled and large quantity of unlabeled text and, importantly, their interpolation from the BERT word embeddings. DALLMi also introduces a label-balanced sampling strategy to overcome the imbalance between labeled and unlabeled data. We evaluate DALLMi against the partial-supervised and unsupervised approach on three datasets under different scenarios of label availability for the target domain. Our results show that DALLMi achieves higher mAP than unsupervised and partially-supervised approaches by 19.9% and 52.2%, respectively.

DALLMi: Domain Adaption for LLM-based Multi-label Classifier

TL;DR

The paper tackles domain shift in multi-label text classification with limited target-domain labels by introducing DALLMi, a semi-supervised domain adaptation framework for BERT-based multi-label classifiers. It combines a per-label variational loss with embedding-level MixUp regularization and a label-balanced sampling strategy to leverage scarce positive labels and abundant unlabeled data from the target domain. Building on Positive-Unlabeled learning (VPU), DALLMi uses embedding interpolation (LERP) to generate synthetic samples and constrain learning through a norm-based variational objective. Across PubMed, ArXiv, and Movies, DALLMi outperforms unsupervised and partially supervised baselines by large margins, with ablations validating the effectiveness of the norm-based loss and embedding MixUp. The approach offers a practical pathway for robust, data-efficient domain adaptation of LLM-based multi-label classifiers, and the authors provide public code to support reproducibility.

Abstract

Large language models (LLMs) increasingly serve as the backbone for classifying text associated with distinct domains and simultaneously several labels (classes). When encountering domain shifts, e.g., classifier of movie reviews from IMDb to Rotten Tomatoes, adapting such an LLM-based multi-label classifier is challenging due to incomplete label sets at the target domain and daunting training overhead. The existing domain adaptation methods address either image multi-label classifiers or text binary classifiers. In this paper, we design DALLMi, Domain Adaptation Large Language Model interpolator, a first-of-its-kind semi-supervised domain adaptation method for text data models based on LLMs, specifically BERT. The core of DALLMi is the novel variation loss and MixUp regularization, which jointly leverage the limited positively labeled and large quantity of unlabeled text and, importantly, their interpolation from the BERT word embeddings. DALLMi also introduces a label-balanced sampling strategy to overcome the imbalance between labeled and unlabeled data. We evaluate DALLMi against the partial-supervised and unsupervised approach on three datasets under different scenarios of label availability for the target domain. Our results show that DALLMi achieves higher mAP than unsupervised and partially-supervised approaches by 19.9% and 52.2%, respectively.
Paper Structure (10 sections, 2 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 10 sections, 2 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Adapting BERT from IMDb to Rotten Tomatoes.
  • Figure 2: BERT multi-label classification flow.
  • Figure 3: DALLMi flow: Unlabeled ($?$) and positive ($1$) samples from the target domain are fed through BERT, generating label-specific output logits. The logits are used to compute partial per-label variational losses for unlabeled and positive samples (dashed box (i)). The MixUp regularization combines per label linear interpolations (LERPs) applied to both inputs and outputs (dashed box (ii)).
  • Figure 4: Examples of possible MixUp strategies by linear interpolation in LLM hidden representations: (i) word embedding, (ii) encoding, (iii) sentence embedding.
  • Figure 5: mAP scores/epoch for: supervised fine-tuning w/ 50% labels (blue), DALLMi w/ 50% labels (green), and supervised fine-tuning w/ 100% labels (red).