Table of Contents
Fetching ...

Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models

Hyeonseok Moon, Jaehyung Seo, Seungyoon Lee, Chanjun Park, Heuiseok Lim

TL;DR

IoInst presents a structured benchmark to probe whether LLMs truly grasp the intended instruction amidst distractor instruction-formatted prompts. By combining a context with four candidate instructions and a meta-instruction across Random, Semantic, and Anti-Attribute settings, the study reveals that most models struggle to identify the true intention, even among instruction-tuned systems. The results highlight the pivotal role of meta-instruction design and show that in-context few-shot prompts can harm rather than help instruction understanding. The work points to data-centric and modeling strategies to strengthen instruction comprehension and reduce distraction by extraneous instructions, with broad implications for evaluation and deployment of LLMs in instruction-rich tasks.

Abstract

One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions. This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields and serves as a crucial metric for evaluating their performance. While numerous evaluation benchmarks have been developed, most focus solely on clear and coherent instructions. However, we have noted that LLMs can become easily distracted by instruction-formatted statements, which may lead to an oversight of their instruction comprehension skills. To address this issue, we introduce the Intention of Instruction (IoInst) benchmark. This benchmark evaluates LLMs' capacity to remain focused and understand instructions without being misled by extraneous instructions. The primary objective of this benchmark is to identify the appropriate instruction that accurately guides the generation of a given context. Our findings suggest that even recently introduced state-of-the-art models still lack instruction understanding capability. Along with the proposition of IoInst in this study, we also present broad analyses of the several strategies potentially applicable to IoInst.

Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models

TL;DR

IoInst presents a structured benchmark to probe whether LLMs truly grasp the intended instruction amidst distractor instruction-formatted prompts. By combining a context with four candidate instructions and a meta-instruction across Random, Semantic, and Anti-Attribute settings, the study reveals that most models struggle to identify the true intention, even among instruction-tuned systems. The results highlight the pivotal role of meta-instruction design and show that in-context few-shot prompts can harm rather than help instruction understanding. The work points to data-centric and modeling strategies to strengthen instruction comprehension and reduce distraction by extraneous instructions, with broad implications for evaluation and deployment of LLMs in instruction-rich tasks.

Abstract

One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions. This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields and serves as a crucial metric for evaluating their performance. While numerous evaluation benchmarks have been developed, most focus solely on clear and coherent instructions. However, we have noted that LLMs can become easily distracted by instruction-formatted statements, which may lead to an oversight of their instruction comprehension skills. To address this issue, we introduce the Intention of Instruction (IoInst) benchmark. This benchmark evaluates LLMs' capacity to remain focused and understand instructions without being misled by extraneous instructions. The primary objective of this benchmark is to identify the appropriate instruction that accurately guides the generation of a given context. Our findings suggest that even recently introduced state-of-the-art models still lack instruction understanding capability. Along with the proposition of IoInst in this study, we also present broad analyses of the several strategies potentially applicable to IoInst.
Paper Structure (35 sections, 1 equation, 7 figures, 14 tables)

This paper contains 35 sections, 1 equation, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Simplified example of IoInst. We compose a benchmark designed to comprehend and select the appropriate instruction that derives given response. Potential error cases include misunderstanding prerequisites of context and responding to any candidate instruction.
  • Figure 2: Construction of pool-based contrastive instructions. From the pre-processed data point obtained by the data curation, we establish our instruction candidates.
  • Figure 3: Construction of Anti-Attribute contrastive instructions. From the pre-processed data point obtained by the data curation, we establish our instruction candidates.
  • Figure 4: Performance comparison between zero-shot and few-shot settings.
  • Figure 5: Performance variations with diverse temperature settings. Temperature 0.0 indicates greedy decoding.
  • ...and 2 more figures