Table of Contents
Fetching ...

Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

Luzhou Peng, Zhengxin Yang, Honglu Ji, Yikang Yang, Fanda Fan, Wanling Gao, Jiayuan Ge, Yilin Han, Jianfeng Zhan

TL;DR

This paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior, and centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits.

Abstract

Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.

Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

TL;DR

This paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior, and centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits.

Abstract

Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.
Paper Structure (74 sections, 39 equations, 24 figures, 21 tables)

This paper contains 74 sections, 39 equations, 24 figures, 21 tables.

Figures (24)

  • Figure 1: High-risk items' failure correlates with wider errors across the dataset. Rows are the 5 highest-risk items in (a) and the 5 lowest-risk items in (b); columns are all MATH-500 items. For a row item $i$ and a column item $k$, the color shows how much the failure probability of item $k$ rises when failing item $i$.
  • Figure 2: A surprising case across LLMs on MATH-500. Kimi-k2, despite higher overall accuracy, fails on this item, whereas lower-accuracy LLMs (GPT-4.1-nano, GLM-4.5-air) succeed.
  • Figure 3: Overview of the Probing Memes Paradigm. Starting from the Perception Matrix, the paradigm computes diverse item-level properties to construct probes, which are then used to detect models’ memes, providing an interpretable view of fine-grained behavioral structure and underlying capabilities.
  • Figure 4: Probe clusters reveal distinct population behavioral patterns (Curated Population). Within each cluster, rows represent items and their corresponding perception spans, which record the error of every model. Two representative clusters are shown: (a) A cluster where all base-prompted variants fail while stronger reasoning modes succeed, suggesting that explicit reasoning improves reliability on this type of item. (b) A cluster where gpt-family models fail consistently despite high accuracy for many other models.
  • Figure 5: A 3D landscape of datasets in the lens of Probe Properties. Six Probe Properties are visualized via the 3D position (axes), color, marker size, and shape, using dataset-level averages over all probes.
  • ...and 19 more figures