Table of Contents
Fetching ...

Towards Open-Ended Discovery for Low-Resource NLP

Bonaventure F. P. Dossou, Henri Aïdasso

TL;DR

The paper addresses the data scarcity and infrastructural barriers hindering NLP for low-resource languages by proposing open-ended language discovery driven by joint human–model uncertainty. It introduces a three-component framework—modeling interactional uncertainty, language acquisition via human feedback, and continual learning from dialogic exposure—underpinned by a composite uncertainty $U_{total} = \alpha U_{human} + (1-\alpha) U_{model}$ and an information-gain–driven query policy $Q^*$. Feedback is integrated through a reliability-weighted target $\tilde{y}$ with $w_f = 1 - U_{human}$, and interactions are stored in a memory bank with weights $w_i = (1-U_{human}^{(i)})(1-U_{model}^{(i)})$ to enable uncertainty-aware updates. If realized, this approach could enable scalable, participatory NLP for under-documented languages, reducing reliance on large corpora and centralized infrastructure while empowering communities to shape AI tools that document and preserve linguistic diversity.

Abstract

Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world's linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.

Towards Open-Ended Discovery for Low-Resource NLP

TL;DR

The paper addresses the data scarcity and infrastructural barriers hindering NLP for low-resource languages by proposing open-ended language discovery driven by joint human–model uncertainty. It introduces a three-component framework—modeling interactional uncertainty, language acquisition via human feedback, and continual learning from dialogic exposure—underpinned by a composite uncertainty and an information-gain–driven query policy . Feedback is integrated through a reliability-weighted target with , and interactions are stored in a memory bank with weights to enable uncertainty-aware updates. If realized, this approach could enable scalable, participatory NLP for under-documented languages, reducing reliance on large corpora and centralized infrastructure while empowering communities to shape AI tools that document and preserve linguistic diversity.

Abstract

Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world's linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.

Paper Structure

This paper contains 18 sections, 13 equations, 1 figure.

Figures (1)

  • Figure 1: Illustration of the proposed approach for open-ended learning of low-resource languages. It shows the voice conversation between a human and an agent who teaches the agent to recognize and respond to requests for the capital city of a country in the Fon language.