Table of Contents
Fetching ...

Effect of Domain Generalization Techniques in Low Resource Systems

Mahi Aminu, Chisom Chibuike, Fatimo Adebanjo, Omokolade Awosanya, Samuel Oyeneye

TL;DR

This study tackles domain generalization under distribution shifts in low-resource NLP by evaluating two causal DG strategies: data-level Counterfactual Data Augmentation (CDA) using paraphrase generation to enforce $P(y|x') = P(y|x)$, and representation-level Invariant Causal Representation Learning (ICRL) via the DINER framework adapted to a multilingual African ABSA benchmark (Afri-SemEval). It introduces a unified experimental setup combining paraphrase-based augmentation on NaijaSenti Yoruba and Igbo with causal representation learning across 17 African languages, enabling cross-domain and cross-language evaluation. Results show that CDA yields consistent cross-domain gains in sentiment classification (notably Yoruba), while DINER-based CRL accelerates convergence and improves in-domain and some OOD performance, though gains vary by language and translation quality due to the dual-shift from cross-lingual transfer. These findings demonstrate the practical viability of causal DG in multilingual, low-resource settings and highlight the need for robust alignment and generation controls to mitigate translation-induced distribution shifts.

Abstract

Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.

Effect of Domain Generalization Techniques in Low Resource Systems

TL;DR

This study tackles domain generalization under distribution shifts in low-resource NLP by evaluating two causal DG strategies: data-level Counterfactual Data Augmentation (CDA) using paraphrase generation to enforce , and representation-level Invariant Causal Representation Learning (ICRL) via the DINER framework adapted to a multilingual African ABSA benchmark (Afri-SemEval). It introduces a unified experimental setup combining paraphrase-based augmentation on NaijaSenti Yoruba and Igbo with causal representation learning across 17 African languages, enabling cross-domain and cross-language evaluation. Results show that CDA yields consistent cross-domain gains in sentiment classification (notably Yoruba), while DINER-based CRL accelerates convergence and improves in-domain and some OOD performance, though gains vary by language and translation quality due to the dual-shift from cross-lingual transfer. These findings demonstrate the practical viability of causal DG in multilingual, low-resource settings and highlight the need for robust alignment and generation controls to mitigate translation-induced distribution shifts.

Abstract

Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.

Paper Structure

This paper contains 25 sections, 11 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1:
  • Figure 2: Causal structure of the DINER framework (wu2024diner).
  • Figure 3: Comparison of accuracy charts. (a) XLM-R, (b) Afro-XLMR-large, (c) Afro-XLMR-large-76L.