Large Language Models Could Be Rote Learners
Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin
TL;DR
This work reframes MCQ benchmark contamination as an inherent learning aspect and disentangles rote memorization from genuine capability in LLM evaluation. It introduces TrinEval, a knowledge-centric reformulation that uses a knowledge entity–attribute–context trinity to preserve essential knowledge while suppressing memorization, and defines corresponding metrics $F_m$ and $F_c$ to quantify memorization and capability. Empirical results on MMLU show that LLMs memorize about 20% of knowledge points, while TrinEval improves robustness by reducing memorization and providing a more reliable gauge of true understanding. The findings challenge the assumption that memorization enhances performance and suggest directions for improving knowledge robustness and evaluation methodologies.
Abstract
Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework reformulating MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).
