Table of Contents
Fetching ...

AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Miaoran Zhang, Mingyang Wang, Jesujoba O. Alabi, Dietrich Klakow

TL;DR

This work tackles Semantic Textual Relatedness (STR) across 14 under-resourced languages in SemEval-2024 Task 1 by introducing AAdaM, a cross-encoder system that integrates machine-translation-based data augmentation, task-adaptive pre-training (TAPT), and adapter-based tuning via MAD-X. Data augmentation uses SemRel and STS-B English translations to enrich training signals, while TAPT adapts the backbone for the STR task; the model is evaluated in both supervised (Subtask A) and cross-lingual transfer (Subtask C) settings using AfroXLMR-large-61L as the backbone. The approach demonstrates competitive performance, achieving first place on average in Subtask A and Subtask C, with Spanish excelling in Subtask A and Indonesian and Punjabi leading in Subtask C; cross-lingual transfer benefits from careful source-language selection. These results highlight the effectiveness of combining MT-based augmentation, TAPT, and modular adapters for multilingual STR in low-resource languages, and they point to future work on reducing dependency on development data for source selection and further refining cross-lingual transfer dynamics.

Abstract

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).

AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

TL;DR

This work tackles Semantic Textual Relatedness (STR) across 14 under-resourced languages in SemEval-2024 Task 1 by introducing AAdaM, a cross-encoder system that integrates machine-translation-based data augmentation, task-adaptive pre-training (TAPT), and adapter-based tuning via MAD-X. Data augmentation uses SemRel and STS-B English translations to enrich training signals, while TAPT adapts the backbone for the STR task; the model is evaluated in both supervised (Subtask A) and cross-lingual transfer (Subtask C) settings using AfroXLMR-large-61L as the backbone. The approach demonstrates competitive performance, achieving first place on average in Subtask A and Subtask C, with Spanish excelling in Subtask A and Indonesian and Punjabi leading in Subtask C; cross-lingual transfer benefits from careful source-language selection. These results highlight the effectiveness of combining MT-based augmentation, TAPT, and modular adapters for multilingual STR in low-resource languages, and they point to future work on reducing dependency on development data for source selection and further refining cross-lingual transfer dynamics.

Abstract

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).
Paper Structure (23 sections, 4 figures, 5 tables)

This paper contains 23 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: SemRel data distribution across languages.
  • Figure 2: Subtask C performance on development sets (Spearman's correlation $\times 100$) using different types of language adapters. Boxes highlight the optimal performances for each target language, and we select the best source for final submission.
  • Figure 3: Performance on test sets (Spearman's correlation $\times 100$) in different relatedness levels.
  • Figure 4: Comparison of different source language selection methods.