Table of Contents
Fetching ...

Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

Luoxi Tang, Tharunya Sundar, Shuai Yang, Ankita Patra, Manohar Chippada, Giqi Zhao, Yi Li, Riteng Zhang, Tunan Zhao, Ting Yang, Yuqiao Meng, Weicheng Ma, Zhaohan Xi

TL;DR

The paper introduces EstBook, a multimodal benchmark that evaluates large language models on five English Standardized Tests with 10,576 questions across 29 types and multiple input modalities. It systematically probes problem-solving and inference using three prompting strategies (ICL, CoT, ToT) and a breakdown-analysis framework that isolates six reasoning steps. Findings show substantial variability in LLM performance across tasks and modalities, limited gains from sophisticated prompting, and relatively weak performance on multi-step, numeric, and multimodal reasoning, with inference time not reliably predicting correctness. The work contributes a realistic evaluation resource, an empirical study of leading LLMs on ESTs, and a granular diagnostic method to identify specific reasoning bottlenecks, informing the design of better educational Ai tutors.

Abstract

AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.

Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

TL;DR

The paper introduces EstBook, a multimodal benchmark that evaluates large language models on five English Standardized Tests with 10,576 questions across 29 types and multiple input modalities. It systematically probes problem-solving and inference using three prompting strategies (ICL, CoT, ToT) and a breakdown-analysis framework that isolates six reasoning steps. Findings show substantial variability in LLM performance across tasks and modalities, limited gains from sophisticated prompting, and relatively weak performance on multi-step, numeric, and multimodal reasoning, with inference time not reliably predicting correctness. The work contributes a realistic evaluation resource, an empirical study of leading LLMs on ESTs, and a granular diagnostic method to identify specific reasoning bottlenecks, informing the design of better educational Ai tutors.

Abstract

AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.

Paper Structure

This paper contains 18 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Examples of multimodal questions included in EstBook: (a) a reading comprehension question (text) from IELTS, (b) a listening comprehension question (audio) from TOEFL, and (c), (d), and (e) GRE quantitative questions involving math symbols, tabular data, and images, respectively.
  • Figure 2: Illustrative breakdown examples for solving EST questions.
  • Figure 3: LLM performance across varying levels of question difficulty, using CoT due to its representativeness. We focus on GRE text completion tasks with 1-, 2-, and 3-blanks, as well as available medium- and hard-level quantitative problems.
  • Figure 4: Inference time (in seconds) for failed and successful cases. More results are in Figure \ref{['fig:expt-time-2']}.
  • Figure 5: Breakdown analysis across all included tasks I-VI (Section \ref{['ssec:task']}) on GPT-4V.
  • ...and 3 more figures