MulCogBench: A Multi-modal Cognitive Benchmark Dataset for Evaluating Chinese and English Computational Language Models
Yunhao Zhang, Xiaohan Zhang, Chong Li, Shaonan Wang, Chengqing Zong
TL;DR
MulCogBench addresses whether pre-trained language models align with human language processing by providing a large-scale, multi-modal cognitive dataset in Chinese and English and evaluating model-cognition similarity via similarity-encoding analysis (SEA). The approach combines eye-tracking, word-level and discourse-level fMRI, and MEG data with four representative language models (Word2Vec, GloVe, BERT/MacBERT, GPT-2) to examine how embeddings relate to brain and behavioral data across modalities and linguistic units. Findings show significant model-cognition similarities, with patterns modulated by modality and linguistic unit; context-aware models outperform context-independent ones as linguistic complexity increases, and shallow layers align with MEG while deeper layers align with fMRI, with high cross-language consistency between Chinese and English. The work provides a valuable, generalizable benchmark for probing cognitive plausibility and guiding brain-informed language model development, while releasing rich data for future research.$M$ denotes a representational similarity matrix and $I_n$ is the identity matrix used in the SEA reconstruction formula $C'=(M-I_n)C$.
Abstract
Pre-trained computational language models have recently made remarkable progress in harnessing the language abilities which were considered unique to humans. Their success has raised interest in whether these models represent and process language like humans. To answer this question, this paper proposes MulCogBench, a multi-modal cognitive benchmark dataset collected from native Chinese and English participants. It encompasses a variety of cognitive data, including subjective semantic ratings, eye-tracking, functional magnetic resonance imaging (fMRI), and magnetoencephalography (MEG). To assess the relationship between language models and cognitive data, we conducted a similarity-encoding analysis which decodes cognitive data based on its pattern similarity with textual embeddings. Results show that language models share significant similarities with human cognitive data and the similarity patterns are modulated by the data modality and stimuli complexity. Specifically, context-aware models outperform context-independent models as language stimulus complexity increases. The shallow layers of context-aware models are better aligned with the high-temporal-resolution MEG signals whereas the deeper layers show more similarity with the high-spatial-resolution fMRI. These results indicate that language models have a delicate relationship with brain language representations. Moreover, the results between Chinese and English are highly consistent, suggesting the generalizability of these findings across languages.
