Table of Contents
Fetching ...

Investigating Hallucination in Conversations for Low Resource Languages

Amit Das, Md. Najib Hasan, Souvika Sarkar, Zheng Zhang, Fatemeh Jamshidi, Tathagata Bhattacharya, Nilanjana Raychawdhury, Dongji Feng, Vinija Jain, Aman Chadha

TL;DR

This study evaluates hallucination in multilingual conversations across Hindi, Farsi, and Mandarin using six LLMs on two translated dialogue datasets (BlendedSkillTalk and DailyDialog). It quantifies factual and semantic fidelity with ROUGE-L, FactCC, and NLI, revealing language-dependent patterns: Mandarin shows minimal lexical overlap yet strong factual consistency, while Hindi and Farsi exhibit more semantic-level hallucinations, especially for smaller open-source models. The findings underscore the critical role of language-resource availability and dataset style, informing mitigation strategies such as retrieval-augmented generation, grounded decoding, and language-aware pretraining. Overall, the work highlights the need for broad, language-inclusive evaluations to guide reliable and equitable conversational AI development.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.

Investigating Hallucination in Conversations for Low Resource Languages

TL;DR

This study evaluates hallucination in multilingual conversations across Hindi, Farsi, and Mandarin using six LLMs on two translated dialogue datasets (BlendedSkillTalk and DailyDialog). It quantifies factual and semantic fidelity with ROUGE-L, FactCC, and NLI, revealing language-dependent patterns: Mandarin shows minimal lexical overlap yet strong factual consistency, while Hindi and Farsi exhibit more semantic-level hallucinations, especially for smaller open-source models. The findings underscore the critical role of language-resource availability and dataset style, informing mitigation strategies such as retrieval-augmented generation, grounded decoding, and language-aware pretraining. Overall, the work highlights the need for broad, language-inclusive evaluations to guide reliable and equitable conversational AI development.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.

Paper Structure

This paper contains 16 sections, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Workflow diagram of our work. It shows a sample conversation where an LLM provides irrelevant response for Hindi, Farsi and Mandarin. The left side are the inputs to the LLM and right side are the irrelevant responses by LLM. We have explored GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3 in this paper.
  • Figure 2: Hallucination (ROUGE-L, FactCC, NLI) scores across the 6 LLMs for Hindi, Farsi and Mandarin on the BlendedSkillTalk dataset. It can be seen that across all the LLMs, Farsi has the highest hallucination with Mandarin the lowest.
  • Figure 3: Hallucination (ROUGE-L, FactCC, NLI) scores across the 6 LLMs for Hindi, Farsi and Mandarin on the BlendedSkillTalk dataset. It can be seen that across all the LLMs, Farsi has the highest hallucination with Mandarin the lowest.