Table of Contents
Fetching ...

Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens

Hanna Hubarava, Yingqiang Gao

Abstract

Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.

Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens

Abstract

Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.

Paper Structure

This paper contains 49 sections, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Mean control-attribute values across models and datasets. Fine-tuned models tend to generate text that is simpler than the source but more complex than the target. Sentence-aligned datasets (Med-EASi, SimPA, WikiLarge) show very little compression signal (around 1.0).
  • Figure 2: Both SimPA and Newsela show larger distribution shifts by readability than by length transformations. Text-aligned Newsela shows greater distribution shift than the sentence-aligned SimPA.
  • Figure 3: Blue: model-native automatic prompt formatting. Green: system prompt. Purple: source text with its control attribute value. Plum: target (reference) control attribute value.
  • Figure 4: WikiLarge. Global sampling with stratification by readability and length, measured in terms of KS, EMD and JSD. Stratification by $N$ chars shows smallest divergence across all metrics.
  • Figure 6: Dataset: SimPA. Control attribute: FKGL. Scaling does not always boost performance. However, we observe strong positive correlation between SARI and COMET, and a strong negative correlation between SARI/COMET and error-based metrics across most datasets and control attributes.
  • ...and 10 more figures