Serial Position Effects of Large Language Models
Xiaobo Guo, Soroush Vosoughi
TL;DR
This work investigates serial position effects (SPE) in large language models across encoder–decoder and decoder-only architectures, spanning classification and summarization tasks. It employs label shuffling and input-reordering with metrics like Jensen–Shannon divergence on predicted distributions and $\text{BERTScore}$ differences to quantify SPE, and evaluates mitigation via prompting and Chain-of-Thought (CoT). The results show SPE are widespread and task- and model-dependent, with primacy dominating in many cases, and mitigation through prompts or CoT being inconsistent across models and tasks. The findings highlight the practical importance of SPE in real-world, unlabeled inference and motivate further research into robust, architecture-aware mitigation strategies for safer LLM deployment.
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in zero-shot learning applications, generating responses to queries using only pre-training information without the need for additional fine-tuning. This represents a significant departure from traditional machine learning approaches. Previous research has indicated that LLMs may exhibit serial position effects, such as primacy and recency biases, which are well-documented cognitive biases in human psychology. Our extensive testing across various tasks and models confirms the widespread occurrence of these effects, although their intensity varies. We also discovered that while carefully designed prompts can somewhat mitigate these biases, their effectiveness is inconsistent. These findings underscore the significance of serial position effects during the inference process, particularly in scenarios where there are no ground truth labels, highlighting the need for greater focus on addressing these effects in LLM applications.
