Table of Contents
Fetching ...

IMO: Greedy Layer-Wise Sparse Representation Learning for Out-of-Distribution Text Classification with Pre-trained Models

Tao Feng, Lizhen Qu, Zhuang Li, Haolan Zhan, Yuncheng Hua, Gholamreza Haffari

TL;DR

This paper tackles single-source domain generalization for text classification by learning invariant representations from pre-trained transformers. It introduces IMO, a greedy layer-wise approach that learns sparse, domain-invariant feature masks and couples them with token-level attention to focus on predictive tokens. Theoretical analysis links invariant representations to causal features and empirically shows IMO outperforms strong baselines, including several open LLMs, across sentiment and topic/social-factor tasks, while showing resilience to data scarcity and providing insights via ablations and feature analyses. The work offers a practical pathway to robust OOD text classification with pre-trained encoders, highlighting the importance of top-down sparse representations and attention in mitigating spurious correlations.

Abstract

Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains. This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training. We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features. During training, IMO would learn sparse mask layers to remove irrelevant features for prediction, where the remaining features keep invariant. Additionally, IMO has an attention module at the token level to focus on tokens that are useful for prediction. Our comprehensive experiments show that IMO substantially outperforms strong baselines in terms of various evaluation metrics and settings.

IMO: Greedy Layer-Wise Sparse Representation Learning for Out-of-Distribution Text Classification with Pre-trained Models

TL;DR

This paper tackles single-source domain generalization for text classification by learning invariant representations from pre-trained transformers. It introduces IMO, a greedy layer-wise approach that learns sparse, domain-invariant feature masks and couples them with token-level attention to focus on predictive tokens. Theoretical analysis links invariant representations to causal features and empirically shows IMO outperforms strong baselines, including several open LLMs, across sentiment and topic/social-factor tasks, while showing resilience to data scarcity and providing insights via ablations and feature analyses. The work offers a practical pathway to robust OOD text classification with pre-trained encoders, highlighting the importance of top-down sparse representations and attention in mitigating spurious correlations.

Abstract

Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains. This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training. We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features. During training, IMO would learn sparse mask layers to remove irrelevant features for prediction, where the remaining features keep invariant. Additionally, IMO has an attention module at the token level to focus on tokens that are useful for prediction. Our comprehensive experiments show that IMO substantially outperforms strong baselines in terms of various evaluation metrics and settings.
Paper Structure (28 sections, 1 theorem, 5 equations, 4 figures, 11 tables)

This paper contains 28 sections, 1 theorem, 5 equations, 4 figures, 11 tables.

Key Result

Corollary 1

If there is no edge between $Y$ and $H_k$ in a causal graph, then $\mathcal{L}_{\Omega}(Y, H_i, ..., H_j) < \mathcal{L}_{\Omega}(Y, H_i, ..., H_j, H_k)$.

Figures (4)

  • Figure 1: The overall architecture of our method IMO.
  • Figure 2: Illustration of potential causal graphs between the variables $H_i$, $H_j$ of two features (encoded from an input $X$) and a target variable $Y$.
  • Figure 3: Visualization of filtering and mask vectors in IMO-BART. The top figure visualizes the filtering vectors ${\bm{m}}$, while the bottom one visualizes the mask vectors ${\bm{q}}$. The x-axis signifies the dimensionality of mask layers, whereas the y-axis denotes values attributed to each dimension.
  • Figure 4: Visualization of attention weights on tokens in Yelp dataset reviews.

Theorems & Definitions (1)

  • Corollary 1