Table of Contents
Fetching ...

Constrained C-Test Generation via Mixed-Integer Programming

Ji-Ung Lee, Marc E. Pfetsch, Iryna Gurevych

TL;DR

This paper introduces a constrained C-Test generation framework using mixed-integer programming (MIP) to jointly optimize gap size and placement. By coupling a gap-difficulty predictor $f_{\\theta}$ with binary decision variables for gap placement $b_i$ and gap size $s_{i,j}$, the approach minimizes $|\\tau - \\hat{\\tau}|$, where $\\hat{\\tau}$ aggregates predicted gap difficulties; this yields globally optimal C-Tests that satisfy hard constraints. An evidence-based gap-difficulty model (primarily XGBoost with Beinborn2016 features plus BERT-derived features) enables end-to-end optimization and robust performance across texts; a user study with 40 participants shows MIP outperforms GPT-4 and SEL and matches SIZE, while correlating best with perceived difficulty. The work provides a practical, provably correct tool for educators to generate tailored C-Tests and highlights avenues for modeling inter-gap dependencies and improving educational LLM control. The authors also publish code, data, and models to foster reproducibility and further research in constrained language exercise generation.

Abstract

This work proposes a novel method to generate C-Tests; a deviated form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap. In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach. This allows us to consider gap size and placement simultaneously, achieving globally optimal solutions, and to directly integrate state-of-the-art models for gap difficulty prediction into the optimization problem. A user study with 40 participants across four C-Test generation strategies (including GPT-4) shows that our approach (MIP) significantly outperforms two of the baseline strategies (based on gap placement and GPT-4); and performs on-par with the third (based on gap size). Our analysis shows that GPT-4 still struggles to fulfill explicit constraints during generation and that MIP produces C-Tests that correlate best with the perceived difficulty. We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (totaling 3,200 individual gap responses) under an open source license.

Constrained C-Test Generation via Mixed-Integer Programming

TL;DR

This paper introduces a constrained C-Test generation framework using mixed-integer programming (MIP) to jointly optimize gap size and placement. By coupling a gap-difficulty predictor with binary decision variables for gap placement and gap size , the approach minimizes , where aggregates predicted gap difficulties; this yields globally optimal C-Tests that satisfy hard constraints. An evidence-based gap-difficulty model (primarily XGBoost with Beinborn2016 features plus BERT-derived features) enables end-to-end optimization and robust performance across texts; a user study with 40 participants shows MIP outperforms GPT-4 and SEL and matches SIZE, while correlating best with perceived difficulty. The work provides a practical, provably correct tool for educators to generate tailored C-Tests and highlights avenues for modeling inter-gap dependencies and improving educational LLM control. The authors also publish code, data, and models to foster reproducibility and further research in constrained language exercise generation.

Abstract

This work proposes a novel method to generate C-Tests; a deviated form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap. In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach. This allows us to consider gap size and placement simultaneously, achieving globally optimal solutions, and to directly integrate state-of-the-art models for gap difficulty prediction into the optimization problem. A user study with 40 participants across four C-Test generation strategies (including GPT-4) shows that our approach (MIP) significantly outperforms two of the baseline strategies (based on gap placement and GPT-4); and performs on-par with the third (based on gap size). Our analysis shows that GPT-4 still struggles to fulfill explicit constraints during generation and that MIP produces C-Tests that correlate best with the perceived difficulty. We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (totaling 3,200 individual gap responses) under an open source license.
Paper Structure (74 sections, 11 equations, 22 figures, 19 tables)

This paper contains 74 sections, 11 equations, 22 figures, 19 tables.

Figures (22)

  • Figure 1: A simplified C-Test generation example. Colors indicate the gap sizes and words considered during generation. While SIZE ($\color{red}\blacksquare$) only varies the gap size with a static placement (every second word) and SEL ($\color{blue}\blacksquare$) only the placement with a static gap size (the second half of a word, rounded up), MIP ($\color{blue!50!red}\transparent{0.5}\blacksquare$) considers all possible combinations. In contrast to MIP, purely neural approaches (GPT-4) provide no theoretical guarantee that all constraints are always satisfied. In this example, the word on is fully turned into a gap although the model correctly states in its response that words are only "partially deleted" in C-Tests (cf. \ref{['fig:gpt-4-simple-example']} for the full prompt and response).
  • Figure 2: MIP vs SIZE. Colored squares indicate the gap error-rates (0.0 $\color{jgreen}\blacksquare$$\color{jgreen!90!red}\blacksquare$$\color{jgreen!80!red}\blacksquare$$\color{jgreen!60!red}\blacksquare$$\color{jgreen!50!red}\blacksquare$$\color{red}\blacksquare$ 1.0)
  • Figure 3: Prompt (top) and response (bottom) of GPT-4 openai2023gpt4 for the request to turn a short sentence into a C-Test. As can be seen, the word on is fully turned into a gap, showing that the model fails to follow all generation constraints for C-Tests.
  • Figure 4: Transformer-based model setups.
  • Figure 5: Absolute differences between the predicted to the true error-rate ($\Delta$ gap error-rate) sorted by gap size.
  • ...and 17 more figures