Table of Contents
Fetching ...

Investigating Context Effects in Similarity Judgements in Large Language Models

Sagar Uprety, Amit Kumar Jaiswal, Haiming Liu, Dawei Song

TL;DR

The paper investigates whether large language models exhibit context-driven order effects in similarity judgments, replicating Tversky and Gati's human study across eight LLMs. Using single- and dual-prompt designs with multiple temperatures and prompt variants, it tests whether LLMs reproduce human-like asymmetry in a $0$–$20$ similarity scale and whether presenting both orders eliminates context effects. The findings show that only a subset of models (notably GPT-4 and certain Llama3 configurations) align with human judgments under specific conditions, while many models display no significant order effects; temperature and prompt style can modulate these biases. These results inform the design and deployment of LLM-based agents, suggesting careful tuning of prompts and hyperparameters to achieve or avoid human-like biases depending on the application.

Abstract

Large Language Models (LLMs) have revolutionised the capability of AI models in comprehending and generating natural language text. They are increasingly being used to empower and deploy agents in real-world scenarios, which make decisions and take actions based on their understanding of the context. Therefore researchers, policy makers and enterprises alike are working towards ensuring that the decisions made by these agents align with human values and user expectations. That being said, human values and decisions are not always straightforward to measure and are subject to different cognitive biases. There is a vast section of literature in Behavioural Science which studies biases in human judgements. In this work we report an ongoing investigation on alignment of LLMs with human judgements affected by order bias. Specifically, we focus on a famous human study which showed evidence of order effects in similarity judgements, and replicate it with various popular LLMs. We report the different settings where LLMs exhibit human-like order effect bias and discuss the implications of these findings to inform the design and development of LLM based applications.

Investigating Context Effects in Similarity Judgements in Large Language Models

TL;DR

The paper investigates whether large language models exhibit context-driven order effects in similarity judgments, replicating Tversky and Gati's human study across eight LLMs. Using single- and dual-prompt designs with multiple temperatures and prompt variants, it tests whether LLMs reproduce human-like asymmetry in a similarity scale and whether presenting both orders eliminates context effects. The findings show that only a subset of models (notably GPT-4 and certain Llama3 configurations) align with human judgments under specific conditions, while many models display no significant order effects; temperature and prompt style can modulate these biases. These results inform the design and deployment of LLM-based agents, suggesting careful tuning of prompts and hyperparameters to achieve or avoid human-like biases depending on the application.

Abstract

Large Language Models (LLMs) have revolutionised the capability of AI models in comprehending and generating natural language text. They are increasingly being used to empower and deploy agents in real-world scenarios, which make decisions and take actions based on their understanding of the context. Therefore researchers, policy makers and enterprises alike are working towards ensuring that the decisions made by these agents align with human values and user expectations. That being said, human values and decisions are not always straightforward to measure and are subject to different cognitive biases. There is a vast section of literature in Behavioural Science which studies biases in human judgements. In this work we report an ongoing investigation on alignment of LLMs with human judgements affected by order bias. Specifically, we focus on a famous human study which showed evidence of order effects in similarity judgements, and replicate it with various popular LLMs. We report the different settings where LLMs exhibit human-like order effect bias and discuss the implications of these findings to inform the design and development of LLM based applications.
Paper Structure (10 sections, 3 figures, 1 table)

This paper contains 10 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Distribution of similarity differences for all countries and all models for single prompt settings
  • Figure 2: Comparing direction and magnitude of similarity scores for each country pair for aligned LLMs and human data (Prompt style SSD)
  • Figure 3: Distribution of similarity differences for all countries and all models for dual prompt settings