Table of Contents
Fetching ...

Exploiting Primacy Effect To Improve Large Language Models

Bianca Raimondi, Maurizio Gabbrielli

TL;DR

The paper addresses positional biases in large language models, especially the primacy effect observed in MCQA tasks. It reveals that fine-tuning amplifies this bias compared to pre-trained models and introduces a training-free method that reorders answer options by semantic similarity to the query, exploiting primacy to boost accuracy. The approach uses a simple metric Sim$(O,Q)$ based on token-wise cosine similarity and a frozen encoder, enabling efficient deployment across datasets (CLINC, BANKING, HWU) and architectures (Llama2, Llama3, Mistral). Empirical results show consistent performance gains, with additional nuances from Recency bias in certain models, suggesting bias-aware strategies can enhance decision-making in biased evaluation environments. Overall, the work reframes bias as a lever for performance, offering practical, model-agnostic techniques and implications for bias-aware NLP design.

Abstract

Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect-where items presented first are more likely to be remembered or selected-plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect by reordering response options based on semantic similarity to the query, without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.

Exploiting Primacy Effect To Improve Large Language Models

TL;DR

The paper addresses positional biases in large language models, especially the primacy effect observed in MCQA tasks. It reveals that fine-tuning amplifies this bias compared to pre-trained models and introduces a training-free method that reorders answer options by semantic similarity to the query, exploiting primacy to boost accuracy. The approach uses a simple metric Sim based on token-wise cosine similarity and a frozen encoder, enabling efficient deployment across datasets (CLINC, BANKING, HWU) and architectures (Llama2, Llama3, Mistral). Empirical results show consistent performance gains, with additional nuances from Recency bias in certain models, suggesting bias-aware strategies can enhance decision-making in biased evaluation environments. Overall, the work reframes bias as a lever for performance, offering practical, model-agnostic techniques and implications for bias-aware NLP design.

Abstract

Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect-where items presented first are more likely to be remembered or selected-plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect by reordering response options based on semantic similarity to the query, without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.

Paper Structure

This paper contains 11 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Given a Query (in this case, "I need a new PIN") and a set of 77 Options, the model must select the unique correct label (in this case, "Change PIN"). Due to the Primacy effect, the model tends to answer correctly only when the correct label is placed in the first positions.
  • Figure 2: An example of the evaluation process for Primacy bias detection. Each query is fixed, and the target label is systematically shuffled across all possible positions. For each position, the model's prediction is recorded, and its correctness is determined by comparing it to the target label. Most of the time, when the target label is placed in the first positions, the model predicts the correct answer, confirming the presence of the Primacy bias.
  • Figure 3: An example of the ranking process of options based on their similarity to the query. The cosine similarity is computed between query and option embeddings. The options are then ranked in descending order of similarity, with the most similar option placed first.
  • Figure 4: Comparison of Primacy bias between pre-trained and fine-tuned versions of models for the CLINC dataset. The x-axis represents the position of the target label, and the y-axis shows accuracy over all the samples. The Cumulative distribution, represented by the red line in the plots, shows the proportion of total accuracy accumulated as the label position increases. Fine-tuned models demonstrate a stronger Primacy bias, with higher accuracy for labels in early positions.
  • Figure 5: Comparison of Primacy bias in Llama3-8B-Instruct across three datasets with varying numbers of labels. The bias is less pronounced in the HWU dataset (54 labels) and more pronounced in the CLINC dataset (150 labels), demonstrating that the Primacy effect intensifies as the number of labels in the prompt increases.
  • ...and 2 more figures