Table of Contents
Fetching ...

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models

Yuchong Sun, Che Liu, Kun Zhou, Jinwen Huang, Ruihua Song, Wayne Xin Zhao, Fuzheng Zhang, Di Zhang, Kun Gai

TL;DR

Parrot tackles the underexplored area of multi-turn instruction following in LLMs by (a) automatically collecting human-like multi-turn instructions through Parrot-Ask, (b) introducing Context-aware Preference Optimization (CaPO) to train models to better leverage context, and (c) establishing MT-Bench++ to evaluate long-turn capabilities. The authors construct Parrot-40K, a long-turn, context-rich dataset including 30K negative examples, enhancing supervision beyond prior datasets. Empirical results show that Parrot-Chat with CaPO achieves state-of-the-art performance among 13B open-source models on MT-Bench and MT-Bench++, with notable improvements on later turns. The work provides open-source data and methods, enabling broader study and development of robust multi-turn instruction-following LLMs, while acknowledging limitations in benchmark size and data sources and outlining safety considerations.

Abstract

Humans often interact with large language models (LLMs) in multi-turn interaction to obtain desired answers or more information. However, most existing studies overlook the multi-turn instruction following ability of LLMs, in terms of training dataset, training method, and evaluation benchmark. In this paper, we introduce Parrot, a solution aiming to enhance multi-turn instruction following for LLMs. First, we introduce an efficient but effective method for collecting multi-turn instructions that feature human-like queries, such as anaphora and ellipsis. Second, we propose a context-aware preference optimization strategy to further enhance LLMs for complex queries in multi-turn interaction. Moreover, to quantitatively evaluate LLMs in multi-turn instruction following, we manually build a multi-turn benchmark derived from existing ones. Extensive experiments show that Parrot improves current LLMs by up to 7.2% in multi-turn instruction following. Our dataset and codes will be open-sourced to facilitate future research.

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models

TL;DR

Parrot tackles the underexplored area of multi-turn instruction following in LLMs by (a) automatically collecting human-like multi-turn instructions through Parrot-Ask, (b) introducing Context-aware Preference Optimization (CaPO) to train models to better leverage context, and (c) establishing MT-Bench++ to evaluate long-turn capabilities. The authors construct Parrot-40K, a long-turn, context-rich dataset including 30K negative examples, enhancing supervision beyond prior datasets. Empirical results show that Parrot-Chat with CaPO achieves state-of-the-art performance among 13B open-source models on MT-Bench and MT-Bench++, with notable improvements on later turns. The work provides open-source data and methods, enabling broader study and development of robust multi-turn instruction-following LLMs, while acknowledging limitations in benchmark size and data sources and outlining safety considerations.

Abstract

Humans often interact with large language models (LLMs) in multi-turn interaction to obtain desired answers or more information. However, most existing studies overlook the multi-turn instruction following ability of LLMs, in terms of training dataset, training method, and evaluation benchmark. In this paper, we introduce Parrot, a solution aiming to enhance multi-turn instruction following for LLMs. First, we introduce an efficient but effective method for collecting multi-turn instructions that feature human-like queries, such as anaphora and ellipsis. Second, we propose a context-aware preference optimization strategy to further enhance LLMs for complex queries in multi-turn interaction. Moreover, to quantitatively evaluate LLMs in multi-turn instruction following, we manually build a multi-turn benchmark derived from existing ones. Extensive experiments show that Parrot improves current LLMs by up to 7.2% in multi-turn instruction following. Our dataset and codes will be open-sourced to facilitate future research.
Paper Structure (35 sections, 3 equations, 8 figures, 10 tables)

This paper contains 35 sections, 3 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: In multi-turn interactions, user queries often require LLMs to effectively utilize contextual information, e.g., anaphora and ellipsis. Directly using ChatGPT to simulate users can not fully mimic the above real-world occasions, while our Parrot-Ask trained on real-world conversations can better human-like queries.
  • Figure 2: The overall framework of Parrot. (a) First, we train the Parrot-Ask model on real user-ChatGPT logs to learn how real users pose queries, and utilize it to iteratively interact with ChatGPT to collect multi-turn instruction-response pairs. (b) Then we construct negative responses for queries that rely heavily on context for answering with three strategies to simulate three types of error cases. Finally, we use the collected data to train the Parrot-Chat model by (c) instruction tuning and (d) context-aware preference optimization.
  • Figure 3: The annotation guidelines given to annotators.
  • Figure 4: MT-Bench++ evaluation prompts for GPT-4.
  • Figure 5: Examples of Parrot-Ask generated queries and comparison with ChatGPT generated ones.
  • ...and 3 more figures