Mind Scramble: Unveiling Large Language Model Psychology Via Typoglycemia

Miao Yu; Junyuan Mao; Guibin Zhang; Jingheng Ye; Junfeng Fang; Aoxiao Zhong; Yang Liu; Yuxuan Liang; Kun Wang; Qingsong Wen

Mind Scramble: Unveiling Large Language Model Psychology Via Typoglycemia

Miao Yu, Junyuan Mao, Guibin Zhang, Jingheng Ye, Junfeng Fang, Aoxiao Zhong, Yang Liu, Yuxuan Liang, Kun Wang, Qingsong Wen

TL;DR

A research line and methodology called LLM Psychology is introduced, leveraging human psychology experiments to investigate the cognitive behaviors and mechanisms of large language models, and migrates the Typoglycemia phenomenon from psychology to explore the mind of LLMs.

Abstract

Research into the external behaviors and internal mechanisms of large language models (LLMs) has shown promise in addressing complex tasks in the physical world. Studies suggest that powerful LLMs, like GPT-4, are beginning to exhibit human-like cognitive abilities, including planning, reasoning, and reflection. In this paper, we introduce a research line and methodology called LLM Psychology, leveraging human psychology experiments to investigate the cognitive behaviors and mechanisms of LLMs. We migrate the Typoglycemia phenomenon from psychology to explore the "mind" of LLMs. Unlike human brains, which rely on context and word patterns to comprehend scrambled text, LLMs use distinct encoding and decoding processes. Through Typoglycemia experiments at the character, word, and sentence levels, we observe: (I) LLMs demonstrate human-like behaviors on a macro scale, such as lower task accuracy and higher token/time consumption; (II) LLMs exhibit varying robustness to scrambled input, making Typoglycemia a benchmark for model evaluation without new datasets; (III) Different task types have varying impacts, with complex logical tasks (e.g., math) being more challenging in scrambled form; (IV) Each LLM has a unique and consistent "cognitive pattern" across tasks, revealing general mechanisms in its psychology process. We provide an in-depth analysis of hidden layers to explain these phenomena, paving the way for future research in LLM Psychology and deeper interpretability.

Mind Scramble: Unveiling Large Language Model Psychology Via Typoglycemia

TL;DR

Abstract

Paper Structure (68 sections, 9 equations, 32 figures, 9 tables)

This paper contains 68 sections, 9 equations, 32 figures, 9 tables.

Introduction
Related Work
Can LLMs recognize Tpyogylcmeia as Typoglycemia?
Typoglycemia Pipeline (TypoPipe)
Typoglycemia Task = TypoC + TypoP
Why Completion and Perception?
Experiment
Experimental Setups
Main Results (RQ1)
Impact of Typoglycemia Functions (RQ2)
Scrambling ratio of Typoglycemia (RQ3)
Why do LLMs align with human performance (RQ4)
Conclusion
Experimental Discussion
Dataset Description
...and 53 more sections

Figures (32)

Figure 1: (Upper Left) The two-step process by which humans handle scrambled text. (Lower Left) The performance comparison among Human, Llama-3.1, GPT-4 on BoolQ dataset in both original and scrambled task description. We observe a widespread phenomenon of maintaining high accuracy, along with counter-intuitive improvements in certain cases. (Right) We draw the two-step process by which LLMs handle scrambled text to parallel with human. Scrambled text here is simple examples for better illustration. See our practical scrambled text cases for experiments in Appendix \ref{['TypoC case']}.
Figure 2: TypoBench Overview. TypoPipe and TypoTask form the two components of our benchmark. The overall pipeline consists of 4 steps: Calibration, Navigation, Refabrication, and Refinement. TypoTask consists of two task categories: TypoC and TypoP which emphasize performance and perception, respectively.
Figure 3: Token and time consumption ratio before and after being processed by TypoFunc when $\mathcal{F}_\Omega=$ REO-INT on character level for BoolQ dataset.
Figure 4: The line charts of accuracy for each model, as the number of operations increase from 1 to 4 when $\mathcal{F}_\Omega=$REO_INT, INS_INT, and DEL_INT at character level on GSM8k dataset.
Figure 5: Decoder Perspective: The cosine similarity between the representations of normal text and the text processed by $\mathcal{F}_\Omega$ in the SQuAD and BoolQ for each layer of the Llama-3.1-8B model (which has 33 layers in total: 1 word embedding layer and 32 Transformer layers). BASE is the standard for similarity calculation.
...and 27 more figures

Mind Scramble: Unveiling Large Language Model Psychology Via Typoglycemia

TL;DR

Abstract

Mind Scramble: Unveiling Large Language Model Psychology Via Typoglycemia

Authors

TL;DR

Abstract

Table of Contents

Figures (32)