From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

Ali Malik; Stephen Mayhew; Chris Piech; Klinton Bicknell

From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

Ali Malik, Stephen Mayhew, Chris Piech, Klinton Bicknell

TL;DR

The paper tackles the practical problem of controlling the language proficiency level of LLM-generated content for language learners by formalizing the Proficiency Control Task (PCT) and evaluating multiple strategies. It combines prompt-based methods, finetuning of open-source models, and reinforcement learning with PPO, along with a simple yet powerful top-$k$ sampling boost. A key contribution is CaLM (CEFR-Aligned Language Model), which, via finetuning plus PPO, matches GPT-4 performance at a fraction of the cost, and even dominates prompting-based GPT-4 in Pareto terms when combined with top-$k$ sampling. The work provides a synthetic TinyTolkien dataset and a broader methodological framework, with human evaluations confirming high quality and alignment to target proficiency, making the approach practical for education-focused content generation and tool-building. Overall, it demonstrates that open-source models can achieve GPT-4-level proficiency control at substantially lower compute, enabling scalable, CEFR-aligned content generation for learners.

Abstract

We study the problem of controlling the difficulty level of text generated by Large Language Models (LLMs) for contexts where end-users are not fully proficient, such as language learners. Using a novel framework, we evaluate the effectiveness of several key approaches for this task, including few-shot prompting, supervised finetuning, and reinforcement learning (RL), utilising both GPT-4 and open source alternatives like LLama2-7B and Mistral-7B. Our findings reveal a large performance gap between GPT-4 and the open source models when using prompt-based strategies. However, we show how to bridge this gap with a careful combination of finetuning and RL alignment. Our best model, CALM (CEFR-Aligned Language Model), surpasses the performance of GPT-4 and other strategies, at only a fraction of the cost. We further validate the quality of our results through a small-scale human study.

From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

TL;DR

sampling boost. A key contribution is CaLM (CEFR-Aligned Language Model), which, via finetuning plus PPO, matches GPT-4 performance at a fraction of the cost, and even dominates prompting-based GPT-4 in Pareto terms when combined with top-

sampling. The work provides a synthetic TinyTolkien dataset and a broader methodological framework, with human evaluations confirming high quality and alignment to target proficiency, making the approach practical for education-focused content generation and tool-building. Overall, it demonstrates that open-source models can achieve GPT-4-level proficiency control at substantially lower compute, enabling scalable, CEFR-aligned content generation for learners.

Abstract

Paper Structure (54 sections, 1 equation, 6 figures, 5 tables)

This paper contains 54 sections, 1 equation, 6 figures, 5 tables.

Introduction
Prompt-based approaches
Finetuning open source models
Boosting PCT models through sampling
Background: CEFR
Automatic CEFR Scoring
The Proficiency Control Task
Control
Quality
Cost
Strategies for Proficiency Control
Prompt-based approaches
Baseline
Describing CEFR
Few-shot Learning
...and 39 more sections

Figures (6)

Figure 1: (top) GPT-4 generates content at a native proficiency level. (bottom) Results from our CaLM proficiency control model for different target levels.
Figure 2: Distribution of different readability metrics for each CEFR level in the generated TinyTolkien data.
Figure 3: Tradeoff between relative cost (in FLOPs) and $\mathit{ControlError}$ for different strategies. Each base point represents a different strategy, and additional points per colour show results for top-$k$ sampling with that strategy. Increasing $k$ reduces the error of any strategy by paying a higher cost. The solid lines represent the theoretical trade-off (estimated using bootstrap sampling) in cost vs $\mathit{ControlError}$ as $k$ is increased for each strategy.
Figure 4: Predicted CEFR scores correspond to human perception of difficulty. As the difference in predicted proficiency scores between story A and story B increases, humans are better able to identify the more challenging story. The yellow dots (y = 0) correspond to instances where the evaluator rated story A as more challenging and the blue dots (y = 1) correspond to when they rated story B as more challenging.
Figure 5: Different CEFR datasets introduce distribution shift. Pearson correlation coefficient of predictions made by a CEFR scorer trained on a particular dataset and evaluated on another. Performance drops off the diagonal due to distribution shift.
...and 1 more figures

From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

TL;DR

Abstract

From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)