IITK at SemEval-2024 Task 1: Contrastive Learning and Autoencoders for Semantic Textual Relatedness in Multilingual Texts
Udvas Basak, Rajarshi Dutta, Shivam Pandey, Ashutosh Modi
TL;DR
This work tackles semantic textual relatedness across 14 languages, including low-resource African and Asian languages. It blends supervised contrastive learning (SimCSE-based) with transformer denoising autoencoders (TSDAE) and develops a 42-feature composite relatedness metric, evaluated on Distil-RoBERTa and multilingual variants. Empirical results show the supervised contrastive approach underperforms on several languages, while the unsupervised TSDAE pathway remains comparatively robust, and language-specific challenges drive future improvements. The study also introduces a bigram-based corpus strategy to refine embeddings, highlighting the need to tailor approaches to diverse linguistic typologies for robust cross-lingual semantic relatedness.
Abstract
This paper describes our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness. The challenge is focused on automatically detecting the degree of relatedness between pairs of sentences for 14 languages including both high and low-resource Asian and African languages. Our team participated in two subtasks consisting of Track A: supervised and Track B: unsupervised. This paper focuses on a BERT-based contrastive learning and similarity metric based approach primarily for the supervised track while exploring autoencoders for the unsupervised track. It also aims on the creation of a bigram relatedness corpus using negative sampling strategy, thereby producing refined word embeddings.
