Table of Contents
Fetching ...

Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

Ahmed Njifenjou, Virgile Sucal, Bassam Jabaian, Fabrice Lefèvre

TL;DR

To enhance the openness of generated dialogues and mimic real life scenarii, the notion of speech events corresponding to the type of conversation the speakers are involved in and also that of common ground which represents the premises of a conversation are added.

Abstract

The prevailing paradigm in the domain of Open-Domain Dialogue agents predominantly focuses on the English language, encompassing both models and datasets. Furthermore, the financial and temporal investments required for crowdsourcing such datasets for finetuning are substantial, particularly when multiple languages are involved. Fortunately, advancements in Large Language Models (LLMs) have unveiled a plethora of possibilities across diverse tasks. Specifically, instruction-tuning has enabled LLMs to execute tasks based on natural language instructions, occasionally surpassing the performance of human crowdworkers. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating Open-Domain Dialogue data in multiple Target Languages using LLMs, with demonstrations provided in a unique Source Language. By eschewing explicit Machine Translation in this approach, we enhance the adherence to language-specific nuances. We apply this methodology to the PersonaChat dataset. To enhance the openness of generated dialogues and mimic real life scenarii, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and also that of common ground which represents the premises of a conversation.

Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

TL;DR

To enhance the openness of generated dialogues and mimic real life scenarii, the notion of speech events corresponding to the type of conversation the speakers are involved in and also that of common ground which represents the premises of a conversation are added.

Abstract

The prevailing paradigm in the domain of Open-Domain Dialogue agents predominantly focuses on the English language, encompassing both models and datasets. Furthermore, the financial and temporal investments required for crowdsourcing such datasets for finetuning are substantial, particularly when multiple languages are involved. Fortunately, advancements in Large Language Models (LLMs) have unveiled a plethora of possibilities across diverse tasks. Specifically, instruction-tuning has enabled LLMs to execute tasks based on natural language instructions, occasionally surpassing the performance of human crowdworkers. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating Open-Domain Dialogue data in multiple Target Languages using LLMs, with demonstrations provided in a unique Source Language. By eschewing explicit Machine Translation in this approach, we enhance the adherence to language-specific nuances. We apply this methodology to the PersonaChat dataset. To enhance the openness of generated dialogues and mimic real life scenarii, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and also that of common ground which represents the premises of a conversation.

Paper Structure

This paper contains 94 sections, 6 equations, 35 figures, 33 tables.

Figures (35)

  • Figure 1: MOUD Generation Pipeline: (0) Taxonomies are manually expanded by interacting with a LLM. (1) Non-translated $l_S$ examples are introduced into the prompt to generate new $l_T$ samples. (2) Common ground is created based on two generated personas and a sampled speech event. (3) The outputs from steps (1) and (2) are integrated into prompts for interactions between two LLM instances. Nucleus sampling is used at every step for diversity. Examples in this figure highlight the display of language's specific elements for French, Spanish and Swahili. For more detailed examples see Table \ref{['tab:example-english-1']}, Table \ref{['tab:example-english-2']}, Table \ref{['tab:example-french']} in Appendix \ref{['appendix:examples-from-MOUD']}. (4) Generated data from steps (1), (2) and (3) are evaluated by human and LLM as a judge on selected criteria as explained in Section \ref{['sec:data-qualitative-evaluation']}.
  • Figure 2: Sunburst chart of the entity with the most root verbs and associated direct object nouns for French generated personas with LLaMA3.1-8B.
  • Figure 3: Demographic Form Completed by Users at their First Login on the Evaluation Platform
  • Figure 4: Additional Guidelines Before Each Conversation's Evaluation on the Platform
  • Figure 5: Persona's Human Evaluation From
  • ...and 30 more figures