Table of Contents
Fetching ...

Safer in Translation? Presupposition Robustness in Indic Languages

Aadi Palnitkar, Arjun Suresh, Rishi Rajesh, Puneet Puli

TL;DR

This work tackles the problem of evaluating healthcare guidance from LLMs in multilingual Indic languages by extending the Cancer-Myth benchmark to Cancer-Myth-Indic, translating a 500-item subset into five languages. It preserves implicit presuppositions during translation and uses fixed scoring (PCS and PCR) to compare GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o under presupposition stress. Key findings show substantial language-conditioned safety gaps, with Dravidian languages being particularly challenging for weaker models, while GPT-4o demonstrates robust presupposition correction across all languages. The study provides deployment guidelines and reproducibility resources, highlighting morphology-driven safety differences and urging language-aware strategies for health AI systems.

Abstract

Increasingly, more and more people are turning to large language models (LLMs) for healthcare advice and consultation, making it important to gauge the efficacy and accuracy of the responses of LLMs to such queries. While there are pre-existing medical benchmarks literature which seeks to accomplish this very task, these benchmarks are almost universally in English, which has led to a notable gap in existing literature pertaining to multilingual LLM evaluation. Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic, an Indic language benchmark built by translating a 500-item subset of Cancer-Myth, sampled evenly across its original categories, into five under-served but widely used languages from the subcontinent (500 per language; 2,500 translated items total). Native-speaker translators followed a style guide for preserving implicit presuppositions in translation; items feature false presuppositions relating to cancer. We evaluate several popular LLMs under this presupposition stress.

Safer in Translation? Presupposition Robustness in Indic Languages

TL;DR

This work tackles the problem of evaluating healthcare guidance from LLMs in multilingual Indic languages by extending the Cancer-Myth benchmark to Cancer-Myth-Indic, translating a 500-item subset into five languages. It preserves implicit presuppositions during translation and uses fixed scoring (PCS and PCR) to compare GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o under presupposition stress. Key findings show substantial language-conditioned safety gaps, with Dravidian languages being particularly challenging for weaker models, while GPT-4o demonstrates robust presupposition correction across all languages. The study provides deployment guidelines and reproducibility resources, highlighting morphology-driven safety differences and urging language-aware strategies for health AI systems.

Abstract

Increasingly, more and more people are turning to large language models (LLMs) for healthcare advice and consultation, making it important to gauge the efficacy and accuracy of the responses of LLMs to such queries. While there are pre-existing medical benchmarks literature which seeks to accomplish this very task, these benchmarks are almost universally in English, which has led to a notable gap in existing literature pertaining to multilingual LLM evaluation. Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic, an Indic language benchmark built by translating a 500-item subset of Cancer-Myth, sampled evenly across its original categories, into five under-served but widely used languages from the subcontinent (500 per language; 2,500 translated items total). Native-speaker translators followed a style guide for preserving implicit presuppositions in translation; items feature false presuppositions relating to cancer. We evaluate several popular LLMs under this presupposition stress.

Paper Structure

This paper contains 27 sections, 2 tables.