Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking dialogs

Arian Askari; Roxana Petcu; Chuan Meng; Mohammad Aliannejadi; Amin Abolghasemi; Evangelos Kanoulas; Suzan Verberne

Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking dialogs

Arian Askari, Roxana Petcu, Chuan Meng, Mohammad Aliannejadi, Amin Abolghasemi, Evangelos Kanoulas, Suzan Verberne

TL;DR

The paper tackles the scarcity of labeled intents in information-seeking dialogs by introducing SOLID, a zero-shot dialog generation framework with self-seeding and multi-intent self-instructing, and SOLID-RL, an efficiency-optimized variant trained with DPO and guided by length-based quality estimation. It constructs two large synthetic datasets, SOLISpeak and SOLITurbo, containing hundreds of thousands of intent-aware dialogs, and demonstrates that IP models trained on these data outperform those trained on human data alone or with few-shot LLM baselines. Key findings include the superiority of self-seeding over external seeds, the effectiveness of multi-intent self-instruction, and substantial efficiency gains from SOLID-RL (approximately 12x faster). The work provides practical data-generation pipelines and benchmarks that significantly augment IP training, with implications for scalable dialog systems and future multi-task extensions across languages.

Abstract

Identifying user intents in information-seeking dialogs is crucial for a system to meet user's information needs. Intent prediction (IP) is challenging and demands sufficient dialogs with human-labeled intents for training. However, manually annotating intents is resource-intensive. While large language models (LLMs) have been shown to be effective in generating synthetic data, there is no study on using LLMs to generate intent-aware information-seeking dialogs. In this paper, we focus on leveraging LLMs for zero-shot generation of large-scale, open-domain, and intent-aware information-seeking dialogs. We propose SOLID, which has novel self-seeding and multi-intent self-instructing schemes. The former improves the generation quality by using the LLM's own knowledge scope to initiate dialog generation; the latter prompts the LLM to generate utterances sequentially, and mitigates the need for manual prompt design by asking the LLM to autonomously adapt its prompt instruction when generating complex multi-intent utterances. Furthermore, we propose SOLID-RL, which is further trained to generate a dialog in one step on the data generated by SOLID. We propose a length-based quality estimation mechanism to assign varying weights to SOLID-generated dialogs based on their quality during the training process of SOLID-RL. We use SOLID and SOLID-RL to generate more than 300k intent-aware dialogs, surpassing the size of existing datasets. Experiments show that IP methods trained on dialogs generated by SOLID and SOLID-RL achieve better IP quality than ones trained on human-generated dialogs.

Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking dialogs

TL;DR

Abstract

Paper Structure (29 sections, 16 figures, 12 tables)

This paper contains 29 sections, 16 figures, 12 tables.

Introduction
Related work
Intent-Aware Dialog Generation
SOLID
SOLID-RL
Dataset Creation
Experimental Setup
Results
Discussion
Conclusion
Appendix
Intents
Manually-crafted instructions for User and Agent Intents
Details of intent predictors
Impact of hallucination
...and 14 more sections

Figures (16)

Figure 1: Example dialog with sequence of intents.
Figure 2: Example of a seed generated through self-seeding via SOLID.
Figure 3: Illustrating SOLID and SOLID-RL: Starting with a self-generated seed and a sequence of real-world intents, SOLID produces utterances sequentially, with each utterance conditioned on one or more intents. During phase 3, each utterance falls into one of two categories: it either corresponds to a single intent, based on a definition from taxonomy \ref{['table:intent_taxonomy']}, or combines multiple intent definitions. In the case of multiple intents, SOLID employs multi-intent self-instruction to merge these into a single cohesive instruction.
Figure 4: The illustration of seed generation through self-seeding via SOLID.
Figure 5: The illustration of seed generation through self-seeding via SOLID.
...and 11 more figures

Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking dialogs

TL;DR

Abstract

Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking dialogs

Authors

TL;DR

Abstract

Table of Contents

Figures (16)