Table of Contents
Fetching ...

Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis

Adrian Marius Dumitran, Adrian Catalin Badea, Stefan-Gabriel Muscalu

TL;DR

This study systematically evaluates large language models on a two-decade Romanian county-level programming dataset (OJI), focusing on C++ and Python solutions to uncover how problem difficulty, language, and prompting influence LLM performance. Using a standardized, multi-attempt evaluation with feedback loops across closed-source and open-weight models, the authors quantify cross-model strengths and limitations, revealing that C++ generally yields better CP results and that GPT-4 offers strong educational potential for lower grades. Key findings include substantial grade-by-grade variation, notable language-specific code quality differences, and the emergence of superior open-weight models, alongside practical insights for educational use and contest design. The work provides a robust, multilingual benchmark and a roadmap for future enhancements—such as tagging, difficulty assessment, English translation, and human–LLM collaboration—to advance LLM-assisted competitive programming and learning outcomes.

Abstract

This study explores the performance of large language models (LLMs) in solving competitive programming problems from the Romanian Informatics Olympiad at the county level. Romania, a leading nation in computer science competitions, provides an ideal environment for evaluating LLM capabilities due to its rich history and stringent competition standards. We collected and analyzed a dataset comprising 304 challenges from 2002 to 2023, focusing on solutions written by LLMs in C++ and Python for these problems. Our primary goal is to understand why LLMs perform well or poorly on different tasks. We evaluated various models, including closed-source models like GPT-4 and open-weight models such as CodeLlama and RoMistral, using a standardized process involving multiple attempts and feedback rounds. The analysis revealed significant variations in LLM performance across different grades and problem types. Notably, GPT-4 showed strong performance, indicating its potential use as an educational tool for middle school students. We also observed differences in code quality and style across various LLMs

Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis

TL;DR

This study systematically evaluates large language models on a two-decade Romanian county-level programming dataset (OJI), focusing on C++ and Python solutions to uncover how problem difficulty, language, and prompting influence LLM performance. Using a standardized, multi-attempt evaluation with feedback loops across closed-source and open-weight models, the authors quantify cross-model strengths and limitations, revealing that C++ generally yields better CP results and that GPT-4 offers strong educational potential for lower grades. Key findings include substantial grade-by-grade variation, notable language-specific code quality differences, and the emergence of superior open-weight models, alongside practical insights for educational use and contest design. The work provides a robust, multilingual benchmark and a roadmap for future enhancements—such as tagging, difficulty assessment, English translation, and human–LLM collaboration—to advance LLM-assisted competitive programming and learning outcomes.

Abstract

This study explores the performance of large language models (LLMs) in solving competitive programming problems from the Romanian Informatics Olympiad at the county level. Romania, a leading nation in computer science competitions, provides an ideal environment for evaluating LLM capabilities due to its rich history and stringent competition standards. We collected and analyzed a dataset comprising 304 challenges from 2002 to 2023, focusing on solutions written by LLMs in C++ and Python for these problems. Our primary goal is to understand why LLMs perform well or poorly on different tasks. We evaluated various models, including closed-source models like GPT-4 and open-weight models such as CodeLlama and RoMistral, using a standardized process involving multiple attempts and feedback rounds. The analysis revealed significant variations in LLM performance across different grades and problem types. Notably, GPT-4 showed strong performance, indicating its potential use as an educational tool for middle school students. We also observed differences in code quality and style across various LLMs
Paper Structure (29 sections, 8 figures, 1 table)

This paper contains 29 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Evaluation Flowchart
  • Figure 2: Total Scores in Grade 5 per Model
  • Figure 3: Code Size
  • Figure 4: Code Length per LLM
  • Figure 5: Trends in OJI Problem Difficulty Over 20 Years
  • ...and 3 more figures