VerAs: Verify then Assess STEM Lab Reports

Berk Atil; Mahsa Sheikhi Karizaki; Rebecca J. Passonneau

VerAs: Verify then Assess STEM Lab Reports

Berk Atil, Mahsa Sheikhi Karizaki, Rebecca J. Passonneau

TL;DR

VerAs tackles automated rubric-based assessment of long-form STEM writing by introducing a two-module OpenQA-inspired architecture that first verifies sentence relevance to a rubric dimension and then grades the selected content. Using dual encoders and an ordinal loss, VerAs assigns a score from 0 to 5 for each rubric dimension, demonstrated on college physics lab reports and middle-school essays. The approach outperforms strong baselines on total and per-dimension metrics, with ablations confirming the verifier's value and showing that the method can generalize to different rubric structures. This work enables scalable, formative feedback in STEM education and points to future enhancements with broader domain coverage and integration with large language models.

Abstract

With an increasing focus in STEM education on critical thinking skills, science writing plays an ever more important role in curricula that stress inquiry skills. A recently published dataset of two sets of college level lab reports from an inquiry-based physics curriculum relies on analytic assessment rubrics that utilize multiple dimensions, specifying subject matter knowledge and general components of good explanations. Each analytic dimension is assessed on a 6-point scale, to provide detailed feedback to students that can help them improve their science writing skills. Manual assessment can be slow, and difficult to calibrate for consistency across all students in large classes. While much work exists on automated assessment of open-ended questions in STEM subjects, there has been far less work on long-form writing such as lab reports. We present an end-to-end neural architecture that has separate verifier and assessment modules, inspired by approaches to Open Domain Question Answering (OpenQA). VerAs first verifies whether a report contains any content relevant to a given rubric dimension, and if so, assesses the relevant sentences. On the lab reports, VerAs outperforms multiple baselines based on OpenQA systems or Automated Essay Scoring (AES). VerAs also performs well on an analytic rubric for middle school physics essays.

VerAs: Verify then Assess STEM Lab Reports

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 3 figures, 5 tables)

This paper contains 16 sections, 5 equations, 3 figures, 5 tables.

Introduction
Related Work
Datasets
VerAs Task and Architecture
Verifier
Grader
Experiments
Baselines
Ablations
Results
Evaluation Metrics
Results by Total Score and by Dimension
Error Analysis of the Verifier's Binary Decision
Results on Middle School Essays
Conclusion
...and 1 more sections

Figures (3)

Figure 1: A rubric dimension from each of two lab reports, with different scoring strategies.
Figure 2: For both lab reports, score distribution per dimension is highly skewed towards low or high scores, depending on the dimension difficulty, as in (a). The skew is less apparent when scores are aggregated across dimensions, as in (b).
Figure 3: VerAs: Using a dual encoder, the verifier assesses each report sentence ($S_i$) and rubric dimension ($D_m$) to forward the top $k$ sentences to the grader, trained with weighted binary cross-entropy loss on whether the report receives a non-zero score. The grader also uses a dual encoder; it concatenates the top $k$ sentences, $D_m$, and the full report $Rep_j$, trained with ordinal log loss as the training objective to assign a score.

VerAs: Verify then Assess STEM Lab Reports

TL;DR

Abstract

VerAs: Verify then Assess STEM Lab Reports

Authors

TL;DR

Abstract

Table of Contents

Figures (3)