Constrained C-Test Generation via Mixed-Integer Programming
Ji-Ung Lee, Marc E. Pfetsch, Iryna Gurevych
TL;DR
This paper introduces a constrained C-Test generation framework using mixed-integer programming (MIP) to jointly optimize gap size and placement. By coupling a gap-difficulty predictor $f_{\\theta}$ with binary decision variables for gap placement $b_i$ and gap size $s_{i,j}$, the approach minimizes $|\\tau - \\hat{\\tau}|$, where $\\hat{\\tau}$ aggregates predicted gap difficulties; this yields globally optimal C-Tests that satisfy hard constraints. An evidence-based gap-difficulty model (primarily XGBoost with Beinborn2016 features plus BERT-derived features) enables end-to-end optimization and robust performance across texts; a user study with 40 participants shows MIP outperforms GPT-4 and SEL and matches SIZE, while correlating best with perceived difficulty. The work provides a practical, provably correct tool for educators to generate tailored C-Tests and highlights avenues for modeling inter-gap dependencies and improving educational LLM control. The authors also publish code, data, and models to foster reproducibility and further research in constrained language exercise generation.
Abstract
This work proposes a novel method to generate C-Tests; a deviated form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap. In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach. This allows us to consider gap size and placement simultaneously, achieving globally optimal solutions, and to directly integrate state-of-the-art models for gap difficulty prediction into the optimization problem. A user study with 40 participants across four C-Test generation strategies (including GPT-4) shows that our approach (MIP) significantly outperforms two of the baseline strategies (based on gap placement and GPT-4); and performs on-par with the third (based on gap size). Our analysis shows that GPT-4 still struggles to fulfill explicit constraints during generation and that MIP produces C-Tests that correlate best with the perceived difficulty. We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (totaling 3,200 individual gap responses) under an open source license.
