Table of Contents
Fetching ...

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Jindřich Libovický, Jindřich Helcl, Andrei Manea, Gianluca Vico

TL;DR

CUS-QA presents a multilingual, multimodal benchmark for open-ended regional question answering in Czech, Slovak, and Ukrainian, grounded in local Wikipedia content. It combines textual and visual QA, manual human evaluation, and automatic metrics, plus retrieval-augmented generation to study knowledge grounding. The study reveals substantial gaps in regional knowledge for current LLMs (textual >40% accuracy; visual <30%), while showing strong correlations between automatic metrics and human judgments, especially for textual tasks. The dataset enables cross-lingual analyses, prompts robustness checks, and RAG evaluations, offering a practical platform to probe localization, cultural biases, and the retrieval-vs-knowledge-storage trade-offs in large language systems.

Abstract

We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and add human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong correlation with human judgment, while traditional string-overlap metrics perform surprisingly well due to the prevalence of named entities in answers.

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

TL;DR

CUS-QA presents a multilingual, multimodal benchmark for open-ended regional question answering in Czech, Slovak, and Ukrainian, grounded in local Wikipedia content. It combines textual and visual QA, manual human evaluation, and automatic metrics, plus retrieval-augmented generation to study knowledge grounding. The study reveals substantial gaps in regional knowledge for current LLMs (textual >40% accuracy; visual <30%), while showing strong correlations between automatic metrics and human judgments, especially for textual tasks. The dataset enables cross-lingual analyses, prompts robustness checks, and RAG evaluations, offering a practical platform to probe localization, cultural biases, and the retrieval-vs-knowledge-storage trade-offs in large language systems.

Abstract

We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and add human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong correlation with human judgment, while traditional string-overlap metrics perform surprisingly well due to the prevalence of named entities in answers.

Paper Structure

This paper contains 31 sections, 29 tables.