Table of Contents
Fetching ...

Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education

Vahid Ashrafimoghari, Necdet Gürkan, Jordan W. Suchow

TL;DR

The paper benchmarks seven major LLMs on official GMAT practice exams to evaluate their potential in business education and exam preparation. GPT-4 Turbo consistently achieves the highest performance across verbal and quantitative domains, with the latest LLMs approaching human-like performance patterns and enabling tutoring capabilities demonstrated in a case study. It provides a nuanced analysis of section-level strengths, model-versus-human performance, and common error categories, while discussing limitations, the risk of misinformation, and broader social and ethical implications. The work argues for careful, framework-guided deployment of AI in education, emphasizing accuracy verification, equitable access, and alignment with human expertise to responsibly harness AI's educational benefits.

Abstract

The rapid evolution of artificial intelligence (AI), especially in the domain of Large Language Models (LLMs) and generative AI, has opened new avenues for application across various fields, yet its role in business education remains underexplored. This study introduces the first benchmark to assess the performance of seven major LLMs, OpenAI's models (GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo), Google's models (PaLM 2, Gemini 1.0 Pro), and Anthropic's models (Claude 2 and Claude 2.1), on the GMAT, which is a key exam in the admission process for graduate business programs. Our analysis shows that most LLMs outperform human candidates, with GPT-4 Turbo not only outperforming the other models but also surpassing the average scores of graduate students at top business schools. Through a case study, this research examines GPT-4 Turbo's ability to explain answers, evaluate responses, identify errors, tailor instructions, and generate alternative scenarios. The latest LLM versions, GPT-4 Turbo, Claude 2.1, and Gemini 1.0 Pro, show marked improvements in reasoning tasks compared to their predecessors, underscoring their potential for complex problem-solving. While AI's promise in education, assessment, and tutoring is clear, challenges remain. Our study not only sheds light on LLMs' academic potential but also emphasizes the need for careful development and application of AI in education. As AI technology advances, it is imperative to establish frameworks and protocols for AI interaction, verify the accuracy of AI-generated content, ensure worldwide access for diverse learners, and create an educational environment where AI supports human expertise. This research sets the stage for further exploration into the responsible use of AI to enrich educational experiences and improve exam preparation and assessment methods.

Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education

TL;DR

The paper benchmarks seven major LLMs on official GMAT practice exams to evaluate their potential in business education and exam preparation. GPT-4 Turbo consistently achieves the highest performance across verbal and quantitative domains, with the latest LLMs approaching human-like performance patterns and enabling tutoring capabilities demonstrated in a case study. It provides a nuanced analysis of section-level strengths, model-versus-human performance, and common error categories, while discussing limitations, the risk of misinformation, and broader social and ethical implications. The work argues for careful, framework-guided deployment of AI in education, emphasizing accuracy verification, equitable access, and alignment with human expertise to responsibly harness AI's educational benefits.

Abstract

The rapid evolution of artificial intelligence (AI), especially in the domain of Large Language Models (LLMs) and generative AI, has opened new avenues for application across various fields, yet its role in business education remains underexplored. This study introduces the first benchmark to assess the performance of seven major LLMs, OpenAI's models (GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo), Google's models (PaLM 2, Gemini 1.0 Pro), and Anthropic's models (Claude 2 and Claude 2.1), on the GMAT, which is a key exam in the admission process for graduate business programs. Our analysis shows that most LLMs outperform human candidates, with GPT-4 Turbo not only outperforming the other models but also surpassing the average scores of graduate students at top business schools. Through a case study, this research examines GPT-4 Turbo's ability to explain answers, evaluate responses, identify errors, tailor instructions, and generate alternative scenarios. The latest LLM versions, GPT-4 Turbo, Claude 2.1, and Gemini 1.0 Pro, show marked improvements in reasoning tasks compared to their predecessors, underscoring their potential for complex problem-solving. While AI's promise in education, assessment, and tutoring is clear, challenges remain. Our study not only sheds light on LLMs' academic potential but also emphasizes the need for careful development and application of AI in education. As AI technology advances, it is imperative to establish frameworks and protocols for AI interaction, verify the accuracy of AI-generated content, ensure worldwide access for diverse learners, and create an educational environment where AI supports human expertise. This research sets the stage for further exploration into the responsible use of AI to enrich educational experiences and improve exam preparation and assessment methods.
Paper Structure (29 sections, 20 figures, 2 tables)

This paper contains 29 sections, 20 figures, 2 tables.

Figures (20)

  • Figure 1: The template employed for generating prompts for every multiple-choice question. Elements shown in double braces are substituted with question-specific values.
  • Figure 2: An example of implementation of template shown in from Figure \ref{['fig:template']}.
  • Figure 3: Original problem statement with image
  • Figure 4: The prompt describes the problem shown in Figure \ref{['fig:original']}, presented without an accompanying image. The correct answer provided by GPT-3.5 Turbo is highlighted in grey.
  • Figure 5: This figure presents a comparative analysis of the average performance across seven LLMs and human candidates in quantitative reasoning, verbal reasoning, and total GMAT scores.
  • ...and 15 more figures