Table of Contents
Fetching ...

Are Large Language Models Truly Smarter Than Humans?

Eshwar Reddy M, Sourav Karmakar

Abstract

Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.

Are Large Language Models Truly Smarter Than Humans?

Abstract

Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.
Paper Structure (38 sections, 2 figures, 7 tables)

This paper contains 38 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Experiment 2 results (all six models).Left: Accuracy by question form and model. Blue = original, orange = paraphrased, red = indirect reference. Most models show accuracy degradation as surface form diverges from training text; DeepSeek-R1 is a notable anomaly (low baseline, minimal drop) explained by Experiment 3. Right: Average accuracy drop (original $\to$ indirect) by subject across all six models. Law ($-$0.20) and Ethics ($-$0.20) show the largest drops, directly corresponding to the highest Experiment 1 contamination rates in the Professional and Humanities categories.
  • Figure 2: Experiment 3 results.Left: TS-Guessing contamination rates by model. Red = OM partial ($\geq$50% overlap); blue = WM exact; green = combined. All models far exceed the 5% random baseline (dashed). DeepSeek-R1 shows the highest OM-partial rate (76.6%) with zero WM-exact recall---the distributed memorization signature. Centre: OM partial contamination rate by model $\times$ MMLU category (heatmap). STEM is consistently the most contaminated category across all models; DeepSeek-R1 reaches 86% in STEM. Right: Contamination rate by category comparing Experiment 1 (web-search lexical, blue) vs. Experiment 3 (TS-Guessing behavioral, red). Both methods independently rank STEM highest, providing convergent multi-method evidence.