Table of Contents
Fetching ...

Large Language Models Could Be Rote Learners

Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin

TL;DR

This work reframes MCQ benchmark contamination as an inherent learning aspect and disentangles rote memorization from genuine capability in LLM evaluation. It introduces TrinEval, a knowledge-centric reformulation that uses a knowledge entity–attribute–context trinity to preserve essential knowledge while suppressing memorization, and defines corresponding metrics $F_m$ and $F_c$ to quantify memorization and capability. Empirical results on MMLU show that LLMs memorize about 20% of knowledge points, while TrinEval improves robustness by reducing memorization and providing a more reliable gauge of true understanding. The findings challenge the assumption that memorization enhances performance and suggest directions for improving knowledge robustness and evaluation methodologies.

Abstract

Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework reformulating MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).

Large Language Models Could Be Rote Learners

TL;DR

This work reframes MCQ benchmark contamination as an inherent learning aspect and disentangles rote memorization from genuine capability in LLM evaluation. It introduces TrinEval, a knowledge-centric reformulation that uses a knowledge entity–attribute–context trinity to preserve essential knowledge while suppressing memorization, and defines corresponding metrics and to quantify memorization and capability. Empirical results on MMLU show that LLMs memorize about 20% of knowledge points, while TrinEval improves robustness by reducing memorization and providing a more reliable gauge of true understanding. The findings challenge the assumption that memorization enhances performance and suggest directions for improving knowledge robustness and evaluation methodologies.

Abstract

Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework reformulating MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).

Paper Structure

This paper contains 21 sections, 2 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: MCQ-based evaluation. We observe that LLMs tend to underperform on memorized MCQs.
  • Figure 2: Model performance on memorized and non-memorized subsets of MMLU, where '0s' and '5s' stand for zero- and five-shot prompting, respectively.
  • Figure 3: Knowledge-preserving validation of TrinEval reformulation. The green, blue, and overlapping regions represent the sets of MCQs correctly answered in the original format, TrinEval format, and both formats, respectively. Best viewed in color.
  • Figure 4: The results of memorization evocation (evoc.) under various dataset-related context, with green and blue curves referring to the memorization difference $\Delta F_{m}$ in the original and TrinEval formats, respectively. In the x-axis, 'clean', 'meta', 'dev-fsp', and 'seq-fsp' stand for without dataset-related context, with the name of the dataset, with few-shot prompt from the development set, and with few-shot prompt from the test set ahead of the testing question. These results of $\Delta F_{m}$ indicate the growing memorization effect given the increasing dataset-related information in general. However, the $\Delta F_{m}$ by TrinEval under the strongest memory evocation context remains consistently lower than the one in the original format, e.g., the red dashed line.
  • Figure 5: The distribution of MCQs based on memorization metric $F_m$ vs. capability metric $F_c$. According to the values of $F_m$ and $F_c$, we separate the MCQs into 25 groups and visualize the MCQ distribution from weak to strong with heatmaps.
  • ...and 2 more figures