Table of Contents
Fetching ...

Policy-based Sentence Simplification: Replacing Parallel Corpora with LLM-as-a-Judge

Xuanxin Wu, Yuki Arase, Masaaki Nagata

TL;DR

This work introduces a policy-driven framework for sentence simplification that uses an LLM-as-a-Judge to generate policy-aligned preference data, eliminating the need for costly parallel corpora. By training with Adaptive Rejection Preference Optimization (ARPO) on data produced from multiple LLMs, the approach yields policy-aligned outputs for two edit policies: lexical-paraphrasing and overall-rewriting. The method enables small open-source LLMs to surpass GPT-4o on lexical tasks and reach comparable performance on overall rewriting, with strong human agreement and robust out-of-domain transfer. This offers a scalable, controllable pathway for tailoring text simplification to diverse audiences and applications, including education and accessibility tools.

Abstract

Sentence simplification aims to modify a sentence to make it easier to read and understand while preserving the meaning. Different applications require distinct simplification policies, such as replacing only complex words at the lexical level or rewriting the entire sentence while trading off details for simplicity. However, achieving such policy-driven control remains an open challenge. In this work, we introduce a simple yet powerful approach that leverages Large Language Model-as-a-Judge (LLM-as-a-Judge) to automatically construct policy-aligned training data, completely removing the need for costly human annotation or parallel corpora. Our method enables building simplification systems that adapt to diverse simplification policies. Remarkably, even small-scale open-source LLMs such as Phi-3-mini-3.8B surpass GPT-4o on lexical-oriented simplification, while achieving comparable performance on overall rewriting, as verified by both automatic metrics and human evaluations. The consistent improvements across model families and sizes demonstrate the robustness of our approach.

Policy-based Sentence Simplification: Replacing Parallel Corpora with LLM-as-a-Judge

TL;DR

This work introduces a policy-driven framework for sentence simplification that uses an LLM-as-a-Judge to generate policy-aligned preference data, eliminating the need for costly parallel corpora. By training with Adaptive Rejection Preference Optimization (ARPO) on data produced from multiple LLMs, the approach yields policy-aligned outputs for two edit policies: lexical-paraphrasing and overall-rewriting. The method enables small open-source LLMs to surpass GPT-4o on lexical tasks and reach comparable performance on overall rewriting, with strong human agreement and robust out-of-domain transfer. This offers a scalable, controllable pathway for tailoring text simplification to diverse audiences and applications, including education and accessibility tools.

Abstract

Sentence simplification aims to modify a sentence to make it easier to read and understand while preserving the meaning. Different applications require distinct simplification policies, such as replacing only complex words at the lexical level or rewriting the entire sentence while trading off details for simplicity. However, achieving such policy-driven control remains an open challenge. In this work, we introduce a simple yet powerful approach that leverages Large Language Model-as-a-Judge (LLM-as-a-Judge) to automatically construct policy-aligned training data, completely removing the need for costly human annotation or parallel corpora. Our method enables building simplification systems that adapt to diverse simplification policies. Remarkably, even small-scale open-source LLMs such as Phi-3-mini-3.8B surpass GPT-4o on lexical-oriented simplification, while achieving comparable performance on overall rewriting, as verified by both automatic metrics and human evaluations. The consistent improvements across model families and sizes demonstrate the robustness of our approach.

Paper Structure

This paper contains 28 sections, 3 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Overview of our framework. We collect simplifications from four LLMs: Qwen2.5-7B qwen2025qwen25technicalreport, Llama3.1-8B grattafiori2024llama3herdmodels, Phi4-14B abdin2024phi4technicalreport, and Qwen3-32B yang2025qwen3technicalreport. Based on the guidelines (++/--: high reward/penalty, +/-: moderate reward/penalty), the reasoning judge LLM evaluates along three dimensions: lexical, structural, and overall. Depending on the edit policy, we use either lexical (for lexical-paraphrasing) or overall (for overall-rewriting) preference to train LLMs.
  • Figure 2: Automatic evaluation results. The higher the better.
  • Figure 3: Impact of training sample size. Colored lines show models trained on our preference data; colored triangles show models trained on $2k$ human-written parallel data. Overlap in triangles is due to the nearly identical LENS scores from Qwen14B and Llama8B.
  • Figure 4: SARI scores on ASSET. The higher the better.
  • Figure 5: Prompts used for sentence-level simplification generation (from wu2025indepthevaluationlargelanguage).
  • ...and 11 more figures