Japanese-English Sentence Translation Exercises Dataset for Automatic Grading

Naoki Miura; Hiroaki Funayama; Seiya Kikuchi; Yuichiroh Matsubayashi; Yuya Iwase; Kentaro Inui

Japanese-English Sentence Translation Exercises Dataset for Automatic Grading

Naoki Miura, Hiroaki Funayama, Seiya Kikuchi, Yuichiroh Matsubayashi, Yuya Iwase, Kentaro Inui

TL;DR

This work introduces a novel task: automatic grading of Sentence Translation Exercises (STEs) by predicting analytic scores for rubrics that reflect educators’ learning objectives. It builds the first STE dataset for Japanese–English with 21 questions and 3,498 responses, annotated with per-criterion scores and justification cues, and reports substantial inter-annotator agreement. Two strong baselines are evaluated: a finetuned BERT model that leverages justification cues as supervision, and GPT-3.5/GPT-4 with in-context learning; overall, BERT achieves higher $F_1$ on correct responses, while GPT models struggle with incorrect responses, indicating the task remains challenging for current LLMs. The results underscore the feasibility of rubric-based automatic STE grading and provide a foundation for future work that may integrate GEC and machine translation to enhance feedback and robustness, while acknowledging limitations in data representativeness and scalability. The dataset and findings offer practical implications for automated formative assessment in L2 learning and point to directions for broader language pairs and richer feedback generation.

Abstract

This paper proposes the task of automatic assessment of Sentence Translation Exercises (STEs), that have been used in the early stage of L2 language learning. We formalize the task as grading student responses for each rubric criterion pre-specified by the educators. We then create a dataset for STE between Japanese and English including 21 questions, along with a total of 3, 498 student responses (167 on average). The answer responses were collected from students and crowd workers. Using this dataset, we demonstrate the performance of baselines including finetuned BERT and GPT models with few-shot in-context learning. Experimental results show that the baseline model with finetuned BERT was able to classify correct responses with approximately 90% in F1, but only less than 80% for incorrect responses. Furthermore, the GPT models with few-shot learning show poorer results than finetuned BERT, indicating that our newly proposed task presents a challenging issue, even for the stateof-the-art large language models.

Japanese-English Sentence Translation Exercises Dataset for Automatic Grading

TL;DR

on correct responses, while GPT models struggle with incorrect responses, indicating the task remains challenging for current LLMs. The results underscore the feasibility of rubric-based automatic STE grading and provide a foundation for future work that may integrate GEC and machine translation to enhance feedback and robustness, while acknowledging limitations in data representativeness and scalability. The dataset and findings offer practical implications for automated formative assessment in L2 learning and point to directions for broader language pairs and richer feedback generation.

Abstract

Paper Structure (35 sections, 7 equations, 3 figures, 5 tables)

This paper contains 35 sections, 7 equations, 3 figures, 5 tables.

Introduction
Automatic scoring of sentence translation exercises
Sentence translation exercises
Task formulation
Analytic score prediction:
Sentence translation exercise (STEs) dataset
Collecting student responses
Annotation:
Justification cue:
Annotation quality:
Statistics of data:
Method
Finetuned BERT model
Architecture:
Training:
...and 20 more sections

Figures (3)

Figure 1: Example of sentence translation exercise. We excerpted the analytic criteria "E3," "O4," and "G4" from Q11 in our dataset. The correct answer is "I had never seen a koala before I saw one in Australia two years ago." "Chunk" denotes a Japanese phrasal unit. "E," "O," and "G" are categories of each analytic criterion, which stand for "expression," "word order," and "grammar," respectively.
Figure 2: Input for the GPT models
Figure 3: The performance of the GPT-3.5 model when changing the number of in-context examples. The x-axis represents the number of in-context examples. The y-axis represents the averaged $F_1$-score among all analytic criteria.

Japanese-English Sentence Translation Exercises Dataset for Automatic Grading

TL;DR

Abstract

Japanese-English Sentence Translation Exercises Dataset for Automatic Grading

Authors

TL;DR

Abstract

Table of Contents

Figures (3)