Exploring Large Language Models for Relevance Judgments in Tetun

Gabriel de Jesus; Sérgio Nunes

Exploring Large Language Models for Relevance Judgments in Tetun

Gabriel de Jesus, Sérgio Nunes

TL;DR

This work investigates automated relevance judgments for a Cranfield-style Tetun test collection, a low-resource language. Using few-shot prompts, the authors compare 70B LLaMA3 against Claude3 Haiku and GPT-3.5 Turbo on 6,100 query–document pairs, reporting an inter-annotator agreement of $0.2634$ for LLaMA3-70B and demonstrating feasibility in low-resource settings. Translation quality and multi-task understanding are analyzed, with paid models showing better Tetun-to-English translation while LLaMA3-70B achieves strong agreement with human annotators, consistent with prior high-resource-language studies. The findings support the potential of openly available LLMs to assist relevance judgments in LR contexts, highlighting cost and efficiency trade-offs and outlining directions for broader evaluation across languages and prompts.

Abstract

The Cranfield paradigm has served as a foundational approach for developing test collections, with relevance judgments typically conducted by human assessors. However, the emergence of large language models (LLMs) has introduced new possibilities for automating these tasks. This paper explores the feasibility of using LLMs to automate relevance assessments, particularly within the context of low-resource languages. In our study, LLMs are employed to automate relevance judgment tasks, by providing a series of query-document pairs in Tetun as the input text. The models are tasked with assigning relevance scores to each pair, where these scores are then compared to those from human annotators to evaluate the inter-annotator agreement levels. Our investigation reveals results that align closely with those reported in studies of high-resource languages.

Exploring Large Language Models for Relevance Judgments in Tetun

TL;DR

for LLaMA3-70B and demonstrating feasibility in low-resource settings. Translation quality and multi-task understanding are analyzed, with paid models showing better Tetun-to-English translation while LLaMA3-70B achieves strong agreement with human annotators, consistent with prior high-resource-language studies. The findings support the potential of openly available LLMs to assist relevance judgments in LR contexts, highlighting cost and efficiency trade-offs and outlining directions for broader evaluation across languages and prompts.

Abstract

Paper Structure (13 sections, 11 tables)

This paper contains 13 sections, 11 tables.

Introduction
Related Work
Collection Overview
Documents
Queries
Relevance Judgments
Relevance Judgments Using LLMs
Overview
Experiment with Tetun
Results and Discussions
Conclusions and Future Work
Acknowledgment
System Prompt Details

Exploring Large Language Models for Relevance Judgments in Tetun

TL;DR

Abstract

Exploring Large Language Models for Relevance Judgments in Tetun

Authors

TL;DR

Abstract

Table of Contents