Serial Position Effects of Large Language Models

Xiaobo Guo; Soroush Vosoughi

Serial Position Effects of Large Language Models

Xiaobo Guo, Soroush Vosoughi

TL;DR

This work investigates serial position effects (SPE) in large language models across encoder–decoder and decoder-only architectures, spanning classification and summarization tasks. It employs label shuffling and input-reordering with metrics like Jensen–Shannon divergence on predicted distributions and $\text{BERTScore}$ differences to quantify SPE, and evaluates mitigation via prompting and Chain-of-Thought (CoT). The results show SPE are widespread and task- and model-dependent, with primacy dominating in many cases, and mitigation through prompts or CoT being inconsistent across models and tasks. The findings highlight the practical importance of SPE in real-world, unlabeled inference and motivate further research into robust, architecture-aware mitigation strategies for safer LLM deployment.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in zero-shot learning applications, generating responses to queries using only pre-training information without the need for additional fine-tuning. This represents a significant departure from traditional machine learning approaches. Previous research has indicated that LLMs may exhibit serial position effects, such as primacy and recency biases, which are well-documented cognitive biases in human psychology. Our extensive testing across various tasks and models confirms the widespread occurrence of these effects, although their intensity varies. We also discovered that while carefully designed prompts can somewhat mitigate these biases, their effectiveness is inconsistent. These findings underscore the significance of serial position effects during the inference process, particularly in scenarios where there are no ground truth labels, highlighting the need for greater focus on addressing these effects in LLM applications.

Serial Position Effects of Large Language Models

TL;DR

differences to quantify SPE, and evaluates mitigation via prompting and Chain-of-Thought (CoT). The results show SPE are widespread and task- and model-dependent, with primacy dominating in many cases, and mitigation through prompts or CoT being inconsistent across models and tasks. The findings highlight the practical importance of SPE in real-world, unlabeled inference and motivate further research into robust, architecture-aware mitigation strategies for safer LLM deployment.

Abstract

Paper Structure (31 sections, 21 figures, 7 tables)

This paper contains 31 sections, 21 figures, 7 tables.

Introduction
Related Work
Models and Datasets
Datasets
Classification Datasets
Summarization Datasets
Experiment Settings
Label Shuffling Experiments
Summarization Experiments
Influence of the Serial Position Effects
Potential Methods for Mitigating Serial Position Effects
Prompting Experiments
Experiments with Prompts
Influence of the Prompt Design
Chain-of-Thought Experiments
...and 16 more sections

Figures (21)

Figure 1: SPE of SOLAR-0-70b-16bit: This model tends to select labels positioned at the beginning and end of a sequence more frequently. The plot illustrates the distribution of label selections across 42 labels, with the x-axis representing label positions and the y-axis the probability of selection. The red line shows the cumulative probability distribution.
Figure 2: Examples from the Banking77 dataset where the input remains the same, but the labels are shuffled.
Figure 3: Distributions and cumulative distributions of predicted labels for each task across all models, with the type of SPE indicated at the top of each figure and the SPEM noted in brackets. The x-axes represent the position of the labels or articles for summarization tasks. The y-axes indicate the difference in BERTScores for summarization tasks and the probability of label selection for other tasks. Red lines illustrate the cumulative probabilities.
Figure 4: Illustration of the various prompts used to direct model attention to specific parts of the input. "N" represents the number of labels constituting one-third of the total list.
Figure 5: t-SNE visualization of label distribution for the TACRED dataset, displayed across different model and prompt combinations. Each color represents a distinct model, and various markers are used to denote different prompts.
...and 16 more figures

Serial Position Effects of Large Language Models

TL;DR

Abstract

Serial Position Effects of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (21)