Table of Contents
Fetching ...

UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

Shubhashis Roy Dipta, Sai Vallurupalli

TL;DR

SemEval-2024 Task 1 targets semantic textual relatedness across 14 low-resource languages under supervised and cross-lingual settings. The authors propose two unified STR models, TranSem (Siamese encoder with mean pooling) and FineSem (T5-based fine-tuning), and probe the impact of machine translation-based data augmentation using NLLB models. They show that direct fine-tuning on STR data can rival embedding-based approaches and that translating data to English yields improvements for several languages, with Track C results achieving top placements for Afrikaans and Indonesian. The work advances multilingual STR by delivering unified architectures, MT-augmentation insights, and publicly available code to spur further research in low-resource languages.

Abstract

The aim of SemEval-2024 Task 1, "Semantic Textual Relatedness for African and Asian Languages" is to develop models for identifying semantic textual relatedness (STR) between two sentences using multiple languages (14 African and Asian languages) and settings (supervised, unsupervised, and cross-lingual). Large language models (LLMs) have shown impressive performance on several natural language understanding tasks such as multilingual machine translation (MMT), semantic similarity (STS), and encoding sentence embeddings. Using a combination of LLMs that perform well on these tasks, we developed two STR models, $\textit{TranSem}$ and $\textit{FineSem}$, for the supervised and cross-lingual settings. We explore the effectiveness of several training methods and the usefulness of machine translation. We find that direct fine-tuning on the task is comparable to using sentence embeddings and translating to English leads to better performance for some languages. In the supervised setting, our model performance is better than the official baseline for 3 languages with the remaining 4 performing on par. In the cross-lingual setting, our model performance is better than the baseline for 3 languages (leading to $1^{st}$ place for Africaans and $2^{nd}$ place for Indonesian), is on par for 2 languages and performs poorly on the remaining 7 languages. Our code is publicly available at https://github.com/dipta007/SemEval24-Task8.

UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

TL;DR

SemEval-2024 Task 1 targets semantic textual relatedness across 14 low-resource languages under supervised and cross-lingual settings. The authors propose two unified STR models, TranSem (Siamese encoder with mean pooling) and FineSem (T5-based fine-tuning), and probe the impact of machine translation-based data augmentation using NLLB models. They show that direct fine-tuning on STR data can rival embedding-based approaches and that translating data to English yields improvements for several languages, with Track C results achieving top placements for Afrikaans and Indonesian. The work advances multilingual STR by delivering unified architectures, MT-augmentation insights, and publicly available code to spur further research in low-resource languages.

Abstract

The aim of SemEval-2024 Task 1, "Semantic Textual Relatedness for African and Asian Languages" is to develop models for identifying semantic textual relatedness (STR) between two sentences using multiple languages (14 African and Asian languages) and settings (supervised, unsupervised, and cross-lingual). Large language models (LLMs) have shown impressive performance on several natural language understanding tasks such as multilingual machine translation (MMT), semantic similarity (STS), and encoding sentence embeddings. Using a combination of LLMs that perform well on these tasks, we developed two STR models, and , for the supervised and cross-lingual settings. We explore the effectiveness of several training methods and the usefulness of machine translation. We find that direct fine-tuning on the task is comparable to using sentence embeddings and translating to English leads to better performance for some languages. In the supervised setting, our model performance is better than the official baseline for 3 languages with the remaining 4 performing on par. In the cross-lingual setting, our model performance is better than the baseline for 3 languages (leading to place for Africaans and place for Indonesian), is on par for 2 languages and performs poorly on the remaining 7 languages. Our code is publicly available at https://github.com/dipta007/SemEval24-Task8.
Paper Structure (22 sections, 1 equation, 1 figure, 4 tables)

This paper contains 22 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of TranSem model architecture (Inspired by Reimers_2019). The encoder ($\theta$) is shared, and the diamond box represents the loss function. The encoded sentence pairs ($x_1, x_2$) and the label ($y$) are the input to the cosine similarity loss.