Table of Contents
Fetching ...

UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Numaan Naeem, Muhammad Ahsan Riaz Khan, Arham Riaz, Muhammad Arslan Manzoor, Yuxia Wang, Preslav Nakov

TL;DR

This work addresses the paucity of factuality evaluation for Urdu by introducing UrduFactBench and UrduFactQA as benchmark datasets for claim verification and LLM factuality in Urdu QA, respectively, and UrduFactCheck as a modular, end-to-end fact-checking framework. The framework employs a three-strategy evidence retrieval pipeline—Monolingual Retrieval, Translated Retrieval, and Thresholded Translated Retrieval—with a threshold parameter $\tau$ to balance accuracy and cost, demonstrated through extensive experiments across twelve LLMs. Experiments show translation-augmented pipelines improve factuality over monolingual approaches, with GPT-4o-family models achieving the best performance while smaller models offer favorable cost-performance trade-offs. The authors provide thorough analyses of dataset construction, translation quality, latency, and error types, and publicly release the resources to enable reproducible research and extensions to other low-resource languages. Overall, UrduFactBench, UrduFactQA, and UrduFactCheck establish a foundation for systematic Urdu factuality research and practical tooling to curb misinformation in Urdu-language contexts.

Abstract

The rapid adoption of Large Language Models (LLMs) has raised important concerns about the factual reliability of their outputs, particularly in low-resource languages such as Urdu. Existing automated fact-checking systems are predominantly developed for English, leaving a significant gap for the more than 200 million Urdu speakers worldwide. In this work, we present UrduFactBench and UrduFactQA, two novel hand-annotated benchmarks designed to enable fact-checking and factual consistency evaluation in Urdu. While UrduFactBench focuses on claim verification, UrduFactQA targets the factuality of LLMs in question answering. These resources, the first of their kind for Urdu, were developed through a multi-stage annotation process involving native Urdu speakers. To complement these benchmarks, we introduce UrduFactCheck, a modular fact-checking framework that incorporates both monolingual and translation-based evidence retrieval strategies to mitigate the scarcity of high-quality Urdu evidence. Leveraging these resources, we conduct an extensive evaluation of twelve LLMs and demonstrate that translation-augmented pipelines consistently enhance performance compared to monolingual ones. Our findings reveal persistent challenges for open-source LLMs in Urdu and underscore the importance of developing targeted resources. All code and data are publicly available at https://github.com/mbzuai-nlp/UrduFactCheck.

UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

TL;DR

This work addresses the paucity of factuality evaluation for Urdu by introducing UrduFactBench and UrduFactQA as benchmark datasets for claim verification and LLM factuality in Urdu QA, respectively, and UrduFactCheck as a modular, end-to-end fact-checking framework. The framework employs a three-strategy evidence retrieval pipeline—Monolingual Retrieval, Translated Retrieval, and Thresholded Translated Retrieval—with a threshold parameter to balance accuracy and cost, demonstrated through extensive experiments across twelve LLMs. Experiments show translation-augmented pipelines improve factuality over monolingual approaches, with GPT-4o-family models achieving the best performance while smaller models offer favorable cost-performance trade-offs. The authors provide thorough analyses of dataset construction, translation quality, latency, and error types, and publicly release the resources to enable reproducible research and extensions to other low-resource languages. Overall, UrduFactBench, UrduFactQA, and UrduFactCheck establish a foundation for systematic Urdu factuality research and practical tooling to curb misinformation in Urdu-language contexts.

Abstract

The rapid adoption of Large Language Models (LLMs) has raised important concerns about the factual reliability of their outputs, particularly in low-resource languages such as Urdu. Existing automated fact-checking systems are predominantly developed for English, leaving a significant gap for the more than 200 million Urdu speakers worldwide. In this work, we present UrduFactBench and UrduFactQA, two novel hand-annotated benchmarks designed to enable fact-checking and factual consistency evaluation in Urdu. While UrduFactBench focuses on claim verification, UrduFactQA targets the factuality of LLMs in question answering. These resources, the first of their kind for Urdu, were developed through a multi-stage annotation process involving native Urdu speakers. To complement these benchmarks, we introduce UrduFactCheck, a modular fact-checking framework that incorporates both monolingual and translation-based evidence retrieval strategies to mitigate the scarcity of high-quality Urdu evidence. Leveraging these resources, we conduct an extensive evaluation of twelve LLMs and demonstrate that translation-augmented pipelines consistently enhance performance compared to monolingual ones. Our findings reveal persistent challenges for open-source LLMs in Urdu and underscore the importance of developing targeted resources. All code and data are publicly available at https://github.com/mbzuai-nlp/UrduFactCheck.

Paper Structure

This paper contains 44 sections, 1 equation, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Three core fact-checking pipelines of UrduFactCheck: (top) end-to-end Urdu framework, (middle) translation-augmented retrieval, and (bottom) threshold-based rerouting using $\tau$ when No. of Evidence $< \tau$. Green indicates Urdu context, while blue indicates English context.
  • Figure 2: Effect of evidence threshold $\tau$ on fact-checking performance and cost for the Factcheck-Bench subset of UrduFactBench. The blue line (left axis) shows the F1 score, while the red line (right axis) shows the total cost ($). Higher thresholds increase cost but can improve F1 up to an optimal range before plateauing.
  • Figure 3: Automatic factuality evaluation results for 12 SOTA LLMS on UrduFactQA using UrduFactCheck-TR. left: the percentage of true claims, center: the number of false claims, and right: the cost of using UrduFactCheck-TR in USD.
  • Figure 4: Datasets error distribution, grouped into nine fine-grained types under four major issues.. "I" represents the base UrduFactCheck pipeline, "II" refers to the version with translation-augmented retrieval, and "III" denotes the version with threshold-based rerouting.
  • Figure 5: UrduFactCheck annotator dashboard built in Streamlit.
  • ...and 7 more figures