Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education
Vahid Ashrafimoghari, Necdet Gürkan, Jordan W. Suchow
TL;DR
The paper benchmarks seven major LLMs on official GMAT practice exams to evaluate their potential in business education and exam preparation. GPT-4 Turbo consistently achieves the highest performance across verbal and quantitative domains, with the latest LLMs approaching human-like performance patterns and enabling tutoring capabilities demonstrated in a case study. It provides a nuanced analysis of section-level strengths, model-versus-human performance, and common error categories, while discussing limitations, the risk of misinformation, and broader social and ethical implications. The work argues for careful, framework-guided deployment of AI in education, emphasizing accuracy verification, equitable access, and alignment with human expertise to responsibly harness AI's educational benefits.
Abstract
The rapid evolution of artificial intelligence (AI), especially in the domain of Large Language Models (LLMs) and generative AI, has opened new avenues for application across various fields, yet its role in business education remains underexplored. This study introduces the first benchmark to assess the performance of seven major LLMs, OpenAI's models (GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo), Google's models (PaLM 2, Gemini 1.0 Pro), and Anthropic's models (Claude 2 and Claude 2.1), on the GMAT, which is a key exam in the admission process for graduate business programs. Our analysis shows that most LLMs outperform human candidates, with GPT-4 Turbo not only outperforming the other models but also surpassing the average scores of graduate students at top business schools. Through a case study, this research examines GPT-4 Turbo's ability to explain answers, evaluate responses, identify errors, tailor instructions, and generate alternative scenarios. The latest LLM versions, GPT-4 Turbo, Claude 2.1, and Gemini 1.0 Pro, show marked improvements in reasoning tasks compared to their predecessors, underscoring their potential for complex problem-solving. While AI's promise in education, assessment, and tutoring is clear, challenges remain. Our study not only sheds light on LLMs' academic potential but also emphasizes the need for careful development and application of AI in education. As AI technology advances, it is imperative to establish frameworks and protocols for AI interaction, verify the accuracy of AI-generated content, ensure worldwide access for diverse learners, and create an educational environment where AI supports human expertise. This research sets the stage for further exploration into the responsible use of AI to enrich educational experiences and improve exam preparation and assessment methods.
