Table of Contents
Fetching ...

Can Language Models Follow Multiple Turns of Entangled Instructions?

Chi Han, Xin Liu, Haodong Wang, Shiyang Li, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Qingyu Yin, Liang Qiu, Changlong Yu, Yifan Gao, Zheng Li, Bing Yin, Jingbo Shang, Heng Ji

TL;DR

The paper tackles how large language models manage multi-turn instructions that can entangle or conflict with one another, a common real-world scenario. It introduces MultiTurnInstruct, a benchmark of ~1.1K dialogues across three difficulty levels and nine capabilities to systematically assess retrieval, tracking, and conflict resolution across turns, using a human-in-the-loop curation process. Key findings reveal a memory vs. reasoning trade-off: models memorize instructions well but struggle to integrate them and resolve conflicts, with attention often biased toward the most recent turns; larger models improve some reasoning tasks yet still falter on contradiction resolution. The work highlights gaps in current architectures and training paradigms for robust multi-turn instruction following and provides a released dataset and codebase to spur progress in data curation and reasoning techniques for real-world, multi-turn interactions.

Abstract

Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct~with $\sim$1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks. Still, their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions. Data and codes are released at https://github.com/Glaciohound/Multi-Turn-Instruct.

Can Language Models Follow Multiple Turns of Entangled Instructions?

TL;DR

The paper tackles how large language models manage multi-turn instructions that can entangle or conflict with one another, a common real-world scenario. It introduces MultiTurnInstruct, a benchmark of ~1.1K dialogues across three difficulty levels and nine capabilities to systematically assess retrieval, tracking, and conflict resolution across turns, using a human-in-the-loop curation process. Key findings reveal a memory vs. reasoning trade-off: models memorize instructions well but struggle to integrate them and resolve conflicts, with attention often biased toward the most recent turns; larger models improve some reasoning tasks yet still falter on contradiction resolution. The work highlights gaps in current architectures and training paradigms for robust multi-turn instruction following and provides a released dataset and codebase to spur progress in data curation and reasoning techniques for real-world, multi-turn interactions.

Abstract

Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct~with 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks. Still, their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions. Data and codes are released at https://github.com/Glaciohound/Multi-Turn-Instruct.

Paper Structure

This paper contains 32 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A comparison between following each instruction individually and the scenario where the last instruction requires consideration of previous instructions. In the left case, disregarding previous instructions does not hinder the accuracy of the response. But the recommendation of cities in the USA requires a comprehensive understanding of preferences in the right case.
  • Figure 2: MultiTurnInstruct consists of $\sim$1.1K spanning across three levels of difficulty and 9 capabilities, with balanced numbers of samples in each capability (numbers shown in the figure). Table \ref{['tab:realmt_tasks']} provides a more detailed list of task descriptions.
  • Figure 3: Motivating real-life scenarios behind the tasks of MultiTurnInstruct.
  • Figure 4: Distribution of conversation turn numbers.
  • Figure 5: Score of mainstream LLMs on MultiTurnInstruct. Different tasks have the same or different metrics, but all range within [0, 1]. Higher always means better performance.
  • ...and 4 more figures