Table of Contents
Fetching ...

ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages

Neha Joshi, Pamir Gogoi, Aasim Mirza, Aayush Jansari, Aditya Yadavalli, Ayushi Pandey, Arunima Shukla, Deepthi Sudharsan, Kalika Bali, Vivek Seshadri

TL;DR

ELR-1000 introduces a community-generated, multimodal recipe dataset spanning 10 endangered Eastern Indic languages to advance culturally grounded NLP. The work documents a careful data collection pipeline, including a pilot study, capacity-building, and a flexible multimodal schema, paired with a rigorous evaluation of six LLMs under contextual and non-contextual prompts. Findings reveal that context dramatically improves translation quality, but persistent cultural misalignments and fluent-then-falsehoods persist, highlighting the need for human-in-the-loop evaluation and culturally aware benchmarks. The paper contributes both the ELR-1000 resource and an evaluative framework that emphasizes ethical data collection, representation of indigenous knowledge, and equitable development of language technologies for underrepresented communities.

Abstract

We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.

ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages

TL;DR

ELR-1000 introduces a community-generated, multimodal recipe dataset spanning 10 endangered Eastern Indic languages to advance culturally grounded NLP. The work documents a careful data collection pipeline, including a pilot study, capacity-building, and a flexible multimodal schema, paired with a rigorous evaluation of six LLMs under contextual and non-contextual prompts. Findings reveal that context dramatically improves translation quality, but persistent cultural misalignments and fluent-then-falsehoods persist, highlighting the need for human-in-the-loop evaluation and culturally aware benchmarks. The paper contributes both the ELR-1000 resource and an evaluative framework that emphasizes ethical data collection, representation of indigenous knowledge, and equitable development of language technologies for underrepresented communities.

Abstract

We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.

Paper Structure

This paper contains 50 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1:
  • Figure 2: Impact of Contextual Information on Model Performance