Table of Contents
Fetching ...

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A. Lotufo, Jayr Alencar Pereira

TL;DR

Quati addresses the shortage of native Brazilian Portuguese IR datasets by introducing a semi-automated pipeline that combines a native corpus with manually authored queries and GPT-4-based relevance judgments. It builds two large-scale passage collections (10M and 1M) sourced from ClueWeb22 and evaluates a diverse set of retrieval systems, using LLM-derived qrels to compute $nDCG@10$ across 50 test queries. The study shows that GPT-4 annotations correlate with human judgments at a cost substantially lower than fully manual labeling, though performance varies by question, and that the dataset supports robust evaluation of both sparse and dense retrieval methods. Quati’s public release and reproducible pipeline offer a scalable approach to creating high-quality IR datasets for other languages, promoting broader, language-specific IR research and benchmarking.

Abstract

Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

TL;DR

Quati addresses the shortage of native Brazilian Portuguese IR datasets by introducing a semi-automated pipeline that combines a native corpus with manually authored queries and GPT-4-based relevance judgments. It builds two large-scale passage collections (10M and 1M) sourced from ClueWeb22 and evaluates a diverse set of retrieval systems, using LLM-derived qrels to compute across 50 test queries. The study shows that GPT-4 annotations correlate with human judgments at a cost substantially lower than fully manual labeling, though performance varies by question, and that the dataset supports robust evaluation of both sparse and dense retrieval methods. Quati’s public release and reproducible pipeline offer a scalable approach to creating high-quality IR datasets for other languages, promoting broader, language-specific IR research and benchmarking.

Abstract

Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .
Paper Structure (18 sections, 2 figures, 15 tables)

This paper contains 18 sections, 2 figures, 15 tables.

Figures (2)

  • Figure 1: Proposed IR dataset creation methodology.
  • Figure 2: When analyzed per query, most of the time humans and GPT-4 find the same questions more confusing to annotate for passage relevance, as the Cohen's Kappa correlation indicates. For 3 questions --- IDs 9, 98 and 154, GPT-4 got very correlated to Human Annotator 2, which explains the higher metrics for those questions.