Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

Han Zhang; Jiamin Su; Li liu

Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

Han Zhang, Jiamin Su, Li liu

Abstract

Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.

Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

Abstract

Paper Structure (47 sections, 10 equations, 8 figures, 11 tables)

This paper contains 47 sections, 10 equations, 8 figures, 11 tables.

Introduction
Related Work
Automated Essay Scoring
LLM-Based Essay Scoring
Ordinal Modeling for Scoring Tasks
Positioning.
Multimodal Essay Scoring
Methodology
Problem Formulation
Decision-Level Ordinal Modeling
DLOM-GF for Multimodal AES
Training Objective
Experiments
Datasets and Evaluation Metrics
Multimodal Dataset.
...and 32 more sections

Figures (8)

Figure 1: Comparison between generation-based scoring and decision-level ordinal modeling (DLOM).
Figure 2: Overview of the proposed decision-level ordinal modeling framework. The framework consists of three stages: (i) supervised fine-tuning (SFT) for semantic encoding, (ii) decision-level score-logit extraction over an ordered score-token set, and (iii) task-specific decision-level objectives: decision-level gated fusion for multimodal scoring and distance-aware regularization for text-only scoring.
Figure 3: Prompt-wise QWK trends across different models on ASAP/ASAP++. Each point corresponds to a specific trait under a given prompt. Vertical dashed lines indicate boundaries between prompts.
Figure 4: Prompt Template for Multimodal Dataset
Figure 5: Prompt Template for Text-only Dataset
...and 3 more figures

Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

Abstract

Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

Authors

Abstract

Table of Contents

Figures (8)