Table of Contents
Fetching ...

TIFIN India at SemEval-2025: Harnessing Translation to Overcome Multilingual IR Challenges in Fact-Checked Claim Retrieval

Prasanna Devadiga, Arya Suneesh, Pawan Kumar Rajpoot, Bharatdeep Hazarika, Aditya U Baliga

TL;DR

This work tackles multilingual fact-check retrieval by combining translation-based data augmentation with a strong embedding-based baseline and selective LLM-assisted re-ranking within a two-stage pipeline. By translating social-media posts into English and fine-tuning a Stella 400M embedding on augmented data, the system achieves robust monolingual performance, while cross-lingual gains are realized through translation and learning across languages. Hard-negative mining significantly boosts retrieval quality, and although LLM-based reranking provides some gains, the largest improvements come from translation and embedding fine-tuning. The approach demonstrates practical feasibility on consumer-grade GPUs and provides a scalable path for multilingual misinformation auditing in real-world fact-checking workflows.

Abstract

We address the challenge of retrieving previously fact-checked claims in monolingual and crosslingual settings - a critical task given the global prevalence of disinformation. Our approach follows a two-stage strategy: a reliable baseline retrieval system using a fine-tuned embedding model and an LLM-based reranker. Our key contribution is demonstrating how LLM-based translation can overcome the hurdles of multilingual information retrieval. Additionally, we focus on ensuring that the bulk of the pipeline can be replicated on a consumer GPU. Our final integrated system achieved a success@10 score of 0.938 and 0.81025 on the monolingual and crosslingual test sets, respectively.

TIFIN India at SemEval-2025: Harnessing Translation to Overcome Multilingual IR Challenges in Fact-Checked Claim Retrieval

TL;DR

This work tackles multilingual fact-check retrieval by combining translation-based data augmentation with a strong embedding-based baseline and selective LLM-assisted re-ranking within a two-stage pipeline. By translating social-media posts into English and fine-tuning a Stella 400M embedding on augmented data, the system achieves robust monolingual performance, while cross-lingual gains are realized through translation and learning across languages. Hard-negative mining significantly boosts retrieval quality, and although LLM-based reranking provides some gains, the largest improvements come from translation and embedding fine-tuning. The approach demonstrates practical feasibility on consumer-grade GPUs and provides a scalable path for multilingual misinformation auditing in real-world fact-checking workflows.

Abstract

We address the challenge of retrieving previously fact-checked claims in monolingual and crosslingual settings - a critical task given the global prevalence of disinformation. Our approach follows a two-stage strategy: a reliable baseline retrieval system using a fine-tuned embedding model and an LLM-based reranker. Our key contribution is demonstrating how LLM-based translation can overcome the hurdles of multilingual information retrieval. Additionally, we focus on ensuring that the bulk of the pipeline can be replicated on a consumer GPU. Our final integrated system achieved a success@10 score of 0.938 and 0.81025 on the monolingual and crosslingual test sets, respectively.

Paper Structure

This paper contains 14 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Data Preparation and Model Fine-tuning Pipeline.
  • Figure 2: Two-Stage Retrieval and Ranking Architecture