Assessing the Capability of LLMs in Solving POSCOMP Questions
Cayo Viegas, Rohit Gheyi, Márcio Ribeiro
TL;DR
This study evaluates the capability of large language models to solve the Brazilian POSCOMP graduate admissions exam, benchmarking ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral across 2022 and 2023, with additional frontier models (e.g., Gemini 2.5 Pro Experimental, o1, o3-mini-high) covering 2022–2024. It finds that text-based questions are more tractable than image-based ones, with ChatGPT-4 leading among earlier models and frontier models achieving 90%+ accuracy across several topics, often surpassing both average and top human performers. The work integrates zero-shot prompting, translation to English, and metamorphic testing to assess robustness, and extends to image-containing prompts and later PDF-based evaluation. The results underscore the rapid advancement of domain-specific capabilities in LLMs and propose practical implications for education, exam design, and assessment analytics, while acknowledging threats to validity and outlining future directions for broader model coverage and prompting strategies.
Abstract
Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT-4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT-4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT-4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years.
