Similarity-Based Domain Adaptation with LLMs
Jie He, Wendi Zhou, Xiang Lorraine Li, Jeff Z. Pan
TL;DR
This work tackles unsupervised domain adaptation by avoiding source-domain retraining and instead annotating target-domain data with an LLM using kNN-based augmentation. It then distills knowledge to smaller language models via two losses: a label-distribution alignment loss and a similarity-alignment loss that transfers representation structure, achieving state-of-the-art-like gains on cross-domain sentiment tasks. The approach demonstrates strong empirical gains, notably a 2.44% accuracy improvement over the previous SOTA across eight task setups, and highlights the importance of both target-label quality and representation-level supervision. This offers a practical pathway to deploy effective cross-domain NLP systems in resource-constrained settings, leveraging LLMs for data annotation while maintaining efficient small-model inference.
Abstract
Unsupervised domain adaptation leverages abundant labeled data from various source domains to generalize onto unlabeled target data. Prior research has primarily focused on learning domain-invariant features across the source and target domains. However, these methods often require training a model using source domain data, which is time-consuming and can limit model usage for applications with different source data. This paper introduces a simple framework that utilizes the impressive generalization capabilities of Large Language Models (LLMs) for target data annotation without the need of source model training, followed by a novel similarity-based knowledge distillation loss. Our extensive experiments on cross-domain text classification reveal that our framework achieves impressive performance, specifically, 2.44\% accuracy improvement when compared to the SOTA method.
