UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

Thanet Markchom; Tong Wu; Liting Huang; Huizhi Liang

UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

Thanet Markchom, Tong Wu, Liting Huang, Huizhi Liang

TL;DR

This work tackles multilingual multimodal idiomaticity representation for image ranking in SemEval-2025 Task 1 Subtask A by combining generative LLMs to produce idiomatic meanings with multilingual CLIP embeddings. It introduces an LLM-based idiom detector, an embedding pipeline that uses either literal compounds or generated idioms, and an ensemble of LLMs to stabilize predictions; a contrastive learning stage with data augmentation further refines embeddings, though fine-tuning yields mixed gains. Experimental results show that multimodal representations outperform baselines that rely solely on nominal compounds, with language-specific best-performing configurations emerging (e.g., GPT-4 plus XLM-R LABSE for English, GPT-3.5 with LABSE-14 for Portuguese). The findings highlight the potential of generative and multimodal signals for idiomaticity tasks and indicate that data-efficient fine-tuning remains a challenge; the work provides code for reproducibility at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.

Abstract

SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning. The source code used in this paper is available at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.

UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

TL;DR

Abstract

UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)