Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation
Yuan An
TL;DR
This study demonstrates that an AI-driven, multi-agent workflow can orchestrate end-to-end scientific tasks for large-scale MCQ generation and evaluation, shifting labor from artifact creation to specification, coordination, and governance. A single researcher coordinates multiple LLMs to extract data, construct a grounding corpus from open textbooks, generate MCQs under explicit constraints, and evaluate both baseline and generated items using a 24-criterion rubric evaluated by two independent LLM judges. Results show high average quality for generated MCQs but partial equivalence with expert-vetted baselines, with persistent gaps in depth, cognitive engagement, and calibration despite strong surface metrics. The work argues for a new professional paradigm—AI research operations—emphasizing constraint formalization, validation loops, provenance auditing, and governance to reliably scale high-volume, AI-assisted scientific production across domains.
Abstract
Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation. Using the generation and assessment of multiple-choice questions (MCQs) as a testbed, we collected 1,071 SAT Math MCQs and employed LLM agents to extract questions from PDFs, retrieve and convert open textbooks into structured representations, align each MCQ with relevant textbook content, generate new MCQs under specified difficulty and cognitive levels, and evaluate both original and generated MCQs using a 24-criterion quality framework. Across all evaluations, average MCQ quality was high. However, criterion-level analysis and equivalence testing show that generated MCQs are not fully comparable to expert-vetted baseline questions. Strict similarity (24/24 criteria equivalent) was never achieved. Persistent gaps concentrated in skill\ depth, cognitive engagement, difficulty calibration, and metadata alignment, while surface-level qualities, such as {grammar fluency}, {clarity options}, {no duplicates}, were consistently strong. Beyond MCQ outcomes, the study documents a labor shift. The researcher's work moved from ``authoring items'' toward {specification, orchestration, verification}, and {governance}. Formalizing constraints, designing rubrics, building validation loops, recovering from tool failures, and auditing provenance constituted the primary activities. We discuss implications for the future of scientific work, including emerging ``AI research operations'' skills required for AI-empowered research pipelines.
