Table of Contents
Fetching ...

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs

Bryan Guan, Tanya Roosta, Peyman Passban, Mehdi Rezagholizadeh

TL;DR

The paper investigates order sensitivity in large language models when prompts are reformatted or reorganized, focusing on closed-source/API-access models. It conducts five experiments across MRPC, MSMARCO, MMLU, MedMCQA, and WebGPT, evaluating zero-shot and few-shot prompts with original and shuffled input orders. Findings show that input order significantly degrades performance across most tasks and models, few-shot prompting offers only limited mitigation, and longer inputs worsen sensitivity. The work highlights practical risks in high-stakes applications and motivates developing robust input-handling techniques and future-model designs to reduce order dependence.

Abstract

As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in the input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in LLMs whose internal components are hidden from users (such as closed-source models or those accessed via API calls). We conduct experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation; however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs

TL;DR

The paper investigates order sensitivity in large language models when prompts are reformatted or reorganized, focusing on closed-source/API-access models. It conducts five experiments across MRPC, MSMARCO, MMLU, MedMCQA, and WebGPT, evaluating zero-shot and few-shot prompts with original and shuffled input orders. Findings show that input order significantly degrades performance across most tasks and models, few-shot prompting offers only limited mitigation, and longer inputs worsen sensitivity. The work highlights practical risks in high-stakes applications and motivates developing robust input-handling techniques and future-model designs to reduce order dependence.

Abstract

As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in the input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in LLMs whose internal components are hidden from users (such as closed-source models or those accessed via API calls). We conduct experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation; however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.

Paper Structure

This paper contains 13 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: How GPT-4o responds to the same question when the order of choices is reversed. The calls were made on Tuesday, May 6th, at 16:29 EST.