Exploring Large Language Models for Relevance Judgments in Tetun
Gabriel de Jesus, Sérgio Nunes
TL;DR
This work investigates automated relevance judgments for a Cranfield-style Tetun test collection, a low-resource language. Using few-shot prompts, the authors compare 70B LLaMA3 against Claude3 Haiku and GPT-3.5 Turbo on 6,100 query–document pairs, reporting an inter-annotator agreement of $0.2634$ for LLaMA3-70B and demonstrating feasibility in low-resource settings. Translation quality and multi-task understanding are analyzed, with paid models showing better Tetun-to-English translation while LLaMA3-70B achieves strong agreement with human annotators, consistent with prior high-resource-language studies. The findings support the potential of openly available LLMs to assist relevance judgments in LR contexts, highlighting cost and efficiency trade-offs and outlining directions for broader evaluation across languages and prompts.
Abstract
The Cranfield paradigm has served as a foundational approach for developing test collections, with relevance judgments typically conducted by human assessors. However, the emergence of large language models (LLMs) has introduced new possibilities for automating these tasks. This paper explores the feasibility of using LLMs to automate relevance assessments, particularly within the context of low-resource languages. In our study, LLMs are employed to automate relevance judgment tasks, by providing a series of query-document pairs in Tetun as the input text. The models are tasked with assigning relevance scores to each pair, where these scores are then compared to those from human annotators to evaluate the inter-annotator agreement levels. Our investigation reveals results that align closely with those reported in studies of high-resource languages.
