Table of Contents
Fetching ...

OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?

Zhen Huang, Zengzhi Wang, Shijie Xia, Pengfei Liu

TL;DR

The paper introduces OlympicArena, a comprehensive, multi-discipline benchmark, and a novel Medal Table ranking to compare recent AI models across subjects, modalities, languages, and reasoning types. Using a strict test split and rule-based evaluation, it analyzes proprietary models (GPT-4o, GPT-4V, Claude-3.5-Sonnet, Gemini-1.5-Pro) alongside open-source competitors, highlighting that GPT-4o generally leads in math and CS while Claude-3.5-Sonnet and Gemini-1.5-Pro excel in knowledge-intensive domains like physics, chemistry, and biology. The study also reveals modality and language gaps, with English performance outperforming Chinese for top models and text-only tasks outperforming multi-modal tasks, pointing to room for improvement in multi-modal reasoning and multilingual capabilities. Overall, the OlympicArena Medal Table offers a transparent, discipline-aware framework for tracking progress toward more capable AI and guides future data coverage and training strategies.

Abstract

In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).

OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?

TL;DR

The paper introduces OlympicArena, a comprehensive, multi-discipline benchmark, and a novel Medal Table ranking to compare recent AI models across subjects, modalities, languages, and reasoning types. Using a strict test split and rule-based evaluation, it analyzes proprietary models (GPT-4o, GPT-4V, Claude-3.5-Sonnet, Gemini-1.5-Pro) alongside open-source competitors, highlighting that GPT-4o generally leads in math and CS while Claude-3.5-Sonnet and Gemini-1.5-Pro excel in knowledge-intensive domains like physics, chemistry, and biology. The study also reveals modality and language gaps, with English performance outperforming Chinese for top models and text-only tasks outperforming multi-modal tasks, pointing to room for improvement in multi-modal reasoning and multilingual capabilities. Overall, the OlympicArena Medal Table offers a transparent, discipline-aware framework for tracking progress toward more capable AI and guides future data coverage and training strategies.

Abstract

In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).

Paper Structure

This paper contains 20 sections, 1 equation, 6 tables.