Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective
You Zhang, Jin Wang, Liang-Chih Yu, Dan Xu, Xuejie Zhang
TL;DR
This work tackles the challenge of non-IID data in text understanding by rethinking PLM adaptation through a Bayesian lens. It introduces M2A, a multi-attribute, multi-grained adaptation framework that ensembles IID and non-IID information using lightweight LoRA-based modules and a joint learning objective that combines multitask optimization with distillation. A KronA-based decomposition further enhances parameter efficiency for fine-grained views, while a Bayesian training scheme leverages both $p(\mathcal{D}_y|\mathcal{D}_x; w)$ and $p(\mathcal{D}_x|w)$ to model data heterogeneity. Empirical results on multi-domain sentiment and personalized sentiment datasets show that M2A consistently outperforms strong baselines, especially as data heterogeneity grows and PLMs scale. The work suggests broad applicability for robust PLM adaptation and points to future directions in richer multi-view data and automated heterogeneity detection.
Abstract
Current neural networks often employ multi-domain-learning or attribute-injecting mechanisms to incorporate non-independent and identically distributed (non-IID) information for text understanding tasks by capturing individual characteristics and the relationships among samples. However, the extent of the impact of non-IID information and how these methods affect pre-trained language models (PLMs) remains unclear. This study revisits the assumption that non-IID information enhances PLMs to achieve performance improvements from a Bayesian perspective, which unearths and integrates non-IID and IID features. Furthermore, we proposed a multi-attribute multi-grained framework for PLM adaptations (M2A), which combines multi-attribute and multi-grained views to mitigate uncertainty in a lightweight manner. We evaluate M2A through prevalent text-understanding datasets and demonstrate its superior performance, mainly when data are implicitly non-IID, and PLMs scale larger.
