Table of Contents
Fetching ...

KCIF: Knowledge-Conditioned Instruction Following

Rudra Murthy, Praveen Venkateswaran, Prince Kumar, Danish Contractor

TL;DR

This work investigates how knowledge tasks interact with instruction following in large language models by introducing a knowledge-conditioned instruction-following benchmark built on multiple-choice tasks. It systematically evaluates how simple answer-modifying instructions—and distractors—affect performance across model families and sizes, revealing substantial performance drops even for frontier models. The authors provide an end-to-end evaluation framework with automated error classification, plus Full and Lite benchmark datasets, enabling scalable, model-agnostic analysis and extension. Key findings show that larger models are more robust yet still suffer notable declines, underscoring the need for jointly studied knowledge/reasoning and instruction-following capabilities. The framework and dataset releases aim to drive future work toward more reliable, instruction-compliant LLM systems.

Abstract

LLM evaluation benchmarks have traditionally separated the testing of knowledge/reasoning capabilities from instruction following. In this work, we study the interaction between knowledge and instruction following, and observe that LLMs struggle to follow simple answer modifying instructions, and are also distracted by instructions that should have no bearing on the original knowledge task answer. We leverage existing multiple-choice answer based knowledge benchmarks and apply a set of simple instructions which include manipulating text (eg.: change case), numeric quantities (eg.: increase value, change formatting), operate on lists (eg.: sort answer candidates) and distractor instructions (eg.: change case of numeric answers). We evaluate models at varying parameter sizes (1B-405B) from different model families and find that, surprisingly, all models report a significant drop in performance on such simple task compositions. While large-sized and frontier models report performance drops of 40-50%, in small and medium sized models the drop is severe (sometimes exceeding 80%). Our results highlight a limitation in the traditional separation of knowledge/reasoning and instruction following, and suggest that joint-study of these capabilities are important. We release our benchmark dataset, evaluation framework code, and results for future work.

KCIF: Knowledge-Conditioned Instruction Following

TL;DR

This work investigates how knowledge tasks interact with instruction following in large language models by introducing a knowledge-conditioned instruction-following benchmark built on multiple-choice tasks. It systematically evaluates how simple answer-modifying instructions—and distractors—affect performance across model families and sizes, revealing substantial performance drops even for frontier models. The authors provide an end-to-end evaluation framework with automated error classification, plus Full and Lite benchmark datasets, enabling scalable, model-agnostic analysis and extension. Key findings show that larger models are more robust yet still suffer notable declines, underscoring the need for jointly studied knowledge/reasoning and instruction-following capabilities. The framework and dataset releases aim to drive future work toward more reliable, instruction-compliant LLM systems.

Abstract

LLM evaluation benchmarks have traditionally separated the testing of knowledge/reasoning capabilities from instruction following. In this work, we study the interaction between knowledge and instruction following, and observe that LLMs struggle to follow simple answer modifying instructions, and are also distracted by instructions that should have no bearing on the original knowledge task answer. We leverage existing multiple-choice answer based knowledge benchmarks and apply a set of simple instructions which include manipulating text (eg.: change case), numeric quantities (eg.: increase value, change formatting), operate on lists (eg.: sort answer candidates) and distractor instructions (eg.: change case of numeric answers). We evaluate models at varying parameter sizes (1B-405B) from different model families and find that, surprisingly, all models report a significant drop in performance on such simple task compositions. While large-sized and frontier models report performance drops of 40-50%, in small and medium sized models the drop is severe (sometimes exceeding 80%). Our results highlight a limitation in the traditional separation of knowledge/reasoning and instruction following, and suggest that joint-study of these capabilities are important. We release our benchmark dataset, evaluation framework code, and results for future work.

Paper Structure

This paper contains 38 sections, 42 figures, 9 tables.

Figures (42)

  • Figure 1: Average exact match performance across all tasks for the print_correct_answer (PCA) and print_correct_answer_label (PCA Label) instructions.
  • Figure 2: Knowledge and instruction following (IF) errors across all tasks for the print_correct_answer instruction. A lower error is better. Results shown using Full Benchmark data. Lite Benchmark results can be found in Appendix Figure \ref{['fig:pcavspcalabelLite']}.
  • Figure 3: Impact of distractor instructions on exact match performance across tasks and instructions, compared to its corresponding $print\_correct\_answer$ performance. A drop indicates the model getting distracted by an inapplicable instruction. Results reported on Lite Benchmark.
  • Figure 4: Classification of errors for the Llama and Qwen family of models.
  • Figure 5: Lite Benchmark: Performance of LLMs on Printing the correct answer task and error comparison. PCA refers to print_correct_answer instruction and PCA label refers to print_correct_answer_label.
  • ...and 37 more figures