Table of Contents
Fetching ...

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

Dung Nguyen Manh, Thang Phan Chau, Nam Le Hai, Thong T. Doan, Nam V. Nguyen, Quang Pham, Nghi D. Q. Bui

TL;DR

CodeMMLU introduces a large-scale MCQ benchmark to assess code understanding and reasoning in CodeLLMs, addressing limitations of generation-focused benchmarks. It consists of nearly 20,000 questions across 52 topics and 10+ programming languages, organized into knowledge-based tests and fundamental coding tasks. The study shows that despite strong performance on knowledge tests, many models struggle with execution and real-world coding tasks, and that prompting strategies like CoT can hurt performance. The results highlight a need for robust, bias-aware evaluation and provide guidance for building more reliable AI-assisted coding tools.

Abstract

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a comprehensive multiple-choice benchmark designed to evaluate the depth of software and code comprehension in LLMs. CodeMMLU includes nearly 20,000 questions spanning diverse domains, including code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks that emphasize code generation, CodeMMLU assesses a model's ability to reason about programs across a wide-range of tasks such as code repair, execution reasoning, and fill-in-the-blank challenges. Our extensive evaluation reveals that even state-of-the-art models struggle with CodeMMLU, highlighting significant gaps in comprehension beyond generation. By emphasizing the essential connection between code understanding and effective AI-assisted development, CodeMMLU provides a critical resource for advancing more reliable and capable coding assistants.

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

TL;DR

CodeMMLU introduces a large-scale MCQ benchmark to assess code understanding and reasoning in CodeLLMs, addressing limitations of generation-focused benchmarks. It consists of nearly 20,000 questions across 52 topics and 10+ programming languages, organized into knowledge-based tests and fundamental coding tasks. The study shows that despite strong performance on knowledge tests, many models struggle with execution and real-world coding tasks, and that prompting strategies like CoT can hurt performance. The results highlight a need for robust, bias-aware evaluation and provide guidance for building more reliable AI-assisted coding tools.

Abstract

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a comprehensive multiple-choice benchmark designed to evaluate the depth of software and code comprehension in LLMs. CodeMMLU includes nearly 20,000 questions spanning diverse domains, including code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks that emphasize code generation, CodeMMLU assesses a model's ability to reason about programs across a wide-range of tasks such as code repair, execution reasoning, and fill-in-the-blank challenges. Our extensive evaluation reveals that even state-of-the-art models struggle with CodeMMLU, highlighting significant gaps in comprehension beyond generation. By emphasizing the essential connection between code understanding and effective AI-assisted development, CodeMMLU provides a critical resource for advancing more reliable and capable coding assistants.
Paper Structure (40 sections, 2 equations, 17 figures, 8 tables)

This paper contains 40 sections, 2 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Summary performance of LLMs on the CodeMMLU benchmark. This radar chart presents the evaluation results (accuracy %) of different models across various CodeMMLU tasks.
  • Figure 2: Overview of CodeMMLU data creation pipeline. The blue diagram describe the process of collecting raw multiple-choice questions (MCQs) from open source internet for a knowledge testset. Otherwise, the pipeline of real-world problem indicated in orange area.
  • Figure 3: Comparison of prompt configuration on GPT-4o. The experiment exposes the drawback of Chain-of-Thought prompting technique in term of boosting performance on task that not require logic or reasoning.
  • Figure 4: CodeMMLU accuracy by task on LLMs. While knowledge tasks are following the scaling law, real-world tasks offer more challenges to LLMs which indicate the performance of instruction tuning and data quality when evaluating on CodeMMLU.
  • Figure 5: Correlation between knowledge tests and fundamental skill tests. Experiments on 10 LLM families show a clear alignment between models with a strong understanding of software knowledge and their performance on diverse problem-solving tasks in the CodeMMLU fundamental skill tests.
  • ...and 12 more figures