Table of Contents
Fetching ...

Task-Oriented Dialog Systems for the Senegalese Wolof Language

Derguene Mbaye, Moussa Diallo

TL;DR

Faced with the scarcity of NLP resources for Wolof, the paper builds a modular ToDS by combining a Rasa-based chatbot engine with cross-lingual annotation projection from French via a dedicated MT system. The core contribution is a three-step annotation-projection pipeline that preserves labels with identifiers, enabling synthetic Wolof data that yields competitive intent classification performance relative to French baselines. Findings show that a Wolof classifier trained on synthetic data achieves macro F1 close to the French data, illustrating the viability of language-agnostic pipelines for low-resource ToDS. The work supports scalable deployment of task-oriented dialogue in multiple low-resource languages and highlights the importance of MT quality and targeted data augmentation.

Abstract

In recent years, we are seeing considerable interest in conversational agents with the rise of large language models (LLMs). Although they offer considerable advantages, LLMs also present significant risks, such as hallucination, which hinder their widespread deployment in industry. Moreover, low-resource languages such as African ones are still underrepresented in these systems limiting their performance in these languages. In this paper, we illustrate a more classical approach based on modular architectures of Task-oriented Dialog Systems (ToDS) offering better control over outputs. We propose a chatbot generation engine based on the Rasa framework and a robust methodology for projecting annotations onto the Wolof language using an in-house machine translation system. After evaluating a generated chatbot trained on the Amazon Massive dataset, our Wolof Intent Classifier performs similarly to the one obtained for French, which is a resource-rich language. We also show that this approach is extensible to other low-resource languages, thanks to the intent classifier's language-agnostic pipeline, simplifying the design of chatbots in these languages.

Task-Oriented Dialog Systems for the Senegalese Wolof Language

TL;DR

Faced with the scarcity of NLP resources for Wolof, the paper builds a modular ToDS by combining a Rasa-based chatbot engine with cross-lingual annotation projection from French via a dedicated MT system. The core contribution is a three-step annotation-projection pipeline that preserves labels with identifiers, enabling synthetic Wolof data that yields competitive intent classification performance relative to French baselines. Findings show that a Wolof classifier trained on synthetic data achieves macro F1 close to the French data, illustrating the viability of language-agnostic pipelines for low-resource ToDS. The work supports scalable deployment of task-oriented dialogue in multiple low-resource languages and highlights the importance of MT quality and targeted data augmentation.

Abstract

In recent years, we are seeing considerable interest in conversational agents with the rise of large language models (LLMs). Although they offer considerable advantages, LLMs also present significant risks, such as hallucination, which hinder their widespread deployment in industry. Moreover, low-resource languages such as African ones are still underrepresented in these systems limiting their performance in these languages. In this paper, we illustrate a more classical approach based on modular architectures of Task-oriented Dialog Systems (ToDS) offering better control over outputs. We propose a chatbot generation engine based on the Rasa framework and a robust methodology for projecting annotations onto the Wolof language using an in-house machine translation system. After evaluating a generated chatbot trained on the Amazon Massive dataset, our Wolof Intent Classifier performs similarly to the one obtained for French, which is a resource-rich language. We also show that this approach is extensible to other low-resource languages, thanks to the intent classifier's language-agnostic pipeline, simplifying the design of chatbots in these languages.

Paper Structure

This paper contains 11 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The three-step annotation projection algorithm: Parsing, which replaces the source annotations with ids, translation of the parsed sentence and backfilling of the translated annotations.
  • Figure 2: Diagram of the chatbot engine's processing of excel files to output Rasa projects. Each Excel file constitutes a domain containing several sheets corresponding to intents, and each sheet contains the intent's example data.
  • Figure 3: Pipeline of user input processing modules defined in the config.yml file generated by the chatbot engine.
  • Figure 4: Intent prediction confidence distribution on the French dataset
  • Figure 5: Intent prediction confidence distribution on the Wolof dataset
  • ...and 1 more figures