Table of Contents
Fetching ...

Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification

Tom Lee, Sihoon Lee, Seonghun Kim

TL;DR

This paper addresses the challenge of assessing authentic student understanding in the era of AI-generated writing, where LLMs threaten traditional open-ended assessments. It proposes a two-stage Human-AI Collaboration framework that pairs rubric-based automated scoring with AI-generated, targeted follow-up questions to obtain process evidence and verify reasoning. A pilot with nine university instructors shows that automatic scoring offers procedural fairness and consistency, while interactive verification is essential for construct validity; instructors also highlight the need for adaptive question difficulty. The work provides a scalable pathway for authentic assessment that treats AI as a synergistic partner in evaluation, rather than a policing threat, with implications for fairness, validity, and classroom practicality.

Abstract

Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship. While recent research has focused on the accuracy of automated scoring (AES), these static approaches fail to capture process evidence or verify genuine student understanding. This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions. In a pilot study with university instructors (N=9), we demonstrate that while Stage 1 (Auto-Scoring) ensures procedural fairness and consistency, Stage 2 (Interactive Verification) is essential for construct validity, effectively diagnosing superficial reasoning or unverified AI use. We report on the systems design, instructor perceptions of fairness versus validity, and the necessity of adaptive difficulty in follow-up questioning. The findings offer a scalable pathway for authentic assessment that moves beyond policing AI to integrating it as a synergistic partner in the evaluation process.

Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification

TL;DR

This paper addresses the challenge of assessing authentic student understanding in the era of AI-generated writing, where LLMs threaten traditional open-ended assessments. It proposes a two-stage Human-AI Collaboration framework that pairs rubric-based automated scoring with AI-generated, targeted follow-up questions to obtain process evidence and verify reasoning. A pilot with nine university instructors shows that automatic scoring offers procedural fairness and consistency, while interactive verification is essential for construct validity; instructors also highlight the need for adaptive question difficulty. The work provides a scalable pathway for authentic assessment that treats AI as a synergistic partner in evaluation, rather than a policing threat, with implications for fairness, validity, and classroom practicality.

Abstract

Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship. While recent research has focused on the accuracy of automated scoring (AES), these static approaches fail to capture process evidence or verify genuine student understanding. This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions. In a pilot study with university instructors (N=9), we demonstrate that while Stage 1 (Auto-Scoring) ensures procedural fairness and consistency, Stage 2 (Interactive Verification) is essential for construct validity, effectively diagnosing superficial reasoning or unverified AI use. We report on the systems design, instructor perceptions of fairness versus validity, and the necessity of adaptive difficulty in follow-up questioning. The findings offer a scalable pathway for authentic assessment that moves beyond policing AI to integrating it as a synergistic partner in the evaluation process.

Paper Structure

This paper contains 20 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Four-stage system overview: Rubric generation $\rightarrow$ Auto-scoring (initial response) $\rightarrow$ Follow-up questions $\rightarrow$ Reassessment (follow-up response).
  • Figure 2: Stage 1a inputs and outputs. (a) Input prompt (abstracted). (b) Generated output object example (subset; translated; illustrative).
  • Figure 3: Stage 2b reassessment artifacts derived from the pilot prototype. Panel (a) abstracts the system prompt guiding final scoring. Panel (b) shows a subset of score adjustments and the corresponding rationale communicated to the learner.
  • Figure 4: Pilot study procedure screenshots.