Table of Contents
Fetching ...

IITK at SemEval-2024 Task 1: Contrastive Learning and Autoencoders for Semantic Textual Relatedness in Multilingual Texts

Udvas Basak, Rajarshi Dutta, Shivam Pandey, Ashutosh Modi

TL;DR

This work tackles semantic textual relatedness across 14 languages, including low-resource African and Asian languages. It blends supervised contrastive learning (SimCSE-based) with transformer denoising autoencoders (TSDAE) and develops a 42-feature composite relatedness metric, evaluated on Distil-RoBERTa and multilingual variants. Empirical results show the supervised contrastive approach underperforms on several languages, while the unsupervised TSDAE pathway remains comparatively robust, and language-specific challenges drive future improvements. The study also introduces a bigram-based corpus strategy to refine embeddings, highlighting the need to tailor approaches to diverse linguistic typologies for robust cross-lingual semantic relatedness.

Abstract

This paper describes our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness. The challenge is focused on automatically detecting the degree of relatedness between pairs of sentences for 14 languages including both high and low-resource Asian and African languages. Our team participated in two subtasks consisting of Track A: supervised and Track B: unsupervised. This paper focuses on a BERT-based contrastive learning and similarity metric based approach primarily for the supervised track while exploring autoencoders for the unsupervised track. It also aims on the creation of a bigram relatedness corpus using negative sampling strategy, thereby producing refined word embeddings.

IITK at SemEval-2024 Task 1: Contrastive Learning and Autoencoders for Semantic Textual Relatedness in Multilingual Texts

TL;DR

This work tackles semantic textual relatedness across 14 languages, including low-resource African and Asian languages. It blends supervised contrastive learning (SimCSE-based) with transformer denoising autoencoders (TSDAE) and develops a 42-feature composite relatedness metric, evaluated on Distil-RoBERTa and multilingual variants. Empirical results show the supervised contrastive approach underperforms on several languages, while the unsupervised TSDAE pathway remains comparatively robust, and language-specific challenges drive future improvements. The study also introduces a bigram-based corpus strategy to refine embeddings, highlighting the need to tailor approaches to diverse linguistic typologies for robust cross-lingual semantic relatedness.

Abstract

This paper describes our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness. The challenge is focused on automatically detecting the degree of relatedness between pairs of sentences for 14 languages including both high and low-resource Asian and African languages. Our team participated in two subtasks consisting of Track A: supervised and Track B: unsupervised. This paper focuses on a BERT-based contrastive learning and similarity metric based approach primarily for the supervised track while exploring autoencoders for the unsupervised track. It also aims on the creation of a bigram relatedness corpus using negative sampling strategy, thereby producing refined word embeddings.
Paper Structure (14 sections, 1 equation, 4 figures, 5 tables)

This paper contains 14 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: SIMCSE based approach for Track A
  • Figure 2: NGD Calculation flowchart
  • Figure 3: Covariance Matrix between all 42 metrics
  • Figure 4: Bigram Corpus Creation Flowchart