Table of Contents
Fetching ...

Controlled Diversity: Length-optimized Natural Language Generation

Diana Marie Schenke, Timo Baumann

TL;DR

This paper tackles the problem of generating text under strict length constraints by augmenting training data with explicit length targets and fine-tuning LLMs using supervised methods and reinforcement-learning from human feedback variants. It systematically compares SFT, PPO, DPO, and ORPO on two data pipelines and shows that ORPO consistently improves adherence to length requirements with minimal quality loss, while PPO and some DPO configurations can be unstable or less effective. A simple data-augmentation strategy—embedding length requirements in prompts—enables effective training and can be paired with either real or model-generated responses; results suggest that ORPO with data augmentation offers a practical path to length-controlled NLG. The work highlights practical trade-offs between maintaining output quality and achieving exact length, discusses generalization limits to short lengths, and points to future work in broader tasks and datasets.

Abstract

LLMs are not generally able to adjust the length of their outputs based on strict length requirements, a capability that would improve their usefulness in applications that require adherence to diverse user and system requirements. We present an approach to train LLMs to acquire this capability by augmenting existing data and applying existing fine-tuning techniques, which we compare based on the trained models' adherence to the length requirement and overall response quality relative to the baseline model. Our results demonstrate that these techniques can be successfully applied to train LLMs to adhere to length requirements, with the trained models generating texts which better align to the length requirements. Our results indicate that our method may change the response quality when using training data that was not generated by the baseline model. This allows simultaneous alignment to another training objective in certain scenarios, but is undesirable otherwise. Training on a dataset containing the model's own responses eliminates this issue.

Controlled Diversity: Length-optimized Natural Language Generation

TL;DR

This paper tackles the problem of generating text under strict length constraints by augmenting training data with explicit length targets and fine-tuning LLMs using supervised methods and reinforcement-learning from human feedback variants. It systematically compares SFT, PPO, DPO, and ORPO on two data pipelines and shows that ORPO consistently improves adherence to length requirements with minimal quality loss, while PPO and some DPO configurations can be unstable or less effective. A simple data-augmentation strategy—embedding length requirements in prompts—enables effective training and can be paired with either real or model-generated responses; results suggest that ORPO with data augmentation offers a practical path to length-controlled NLG. The work highlights practical trade-offs between maintaining output quality and achieving exact length, discusses generalization limits to short lengths, and points to future work in broader tasks and datasets.

Abstract

LLMs are not generally able to adjust the length of their outputs based on strict length requirements, a capability that would improve their usefulness in applications that require adherence to diverse user and system requirements. We present an approach to train LLMs to acquire this capability by augmenting existing data and applying existing fine-tuning techniques, which we compare based on the trained models' adherence to the length requirement and overall response quality relative to the baseline model. Our results demonstrate that these techniques can be successfully applied to train LLMs to adhere to length requirements, with the trained models generating texts which better align to the length requirements. Our results indicate that our method may change the response quality when using training data that was not generated by the baseline model. This allows simultaneous alignment to another training objective in certain scenarios, but is undesirable otherwise. Training on a dataset containing the model's own responses eliminates this issue.

Paper Structure

This paper contains 6 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Process Diagram for the PPO, DPO and ORPO approaches to Reinforcement Learning with Human Feedback. All approaches require a pretrained LLM and a preference dataset, typically human-made.
  • Figure 2: Distribution of length requirements in our test data set ($n=1280$ samples). Each of the four portions of the test data is of equal size.
  • Figure 3: A schematic overview of our training process, described in detail in the Methods section. The most successful models in each step are highlighted in green.
  • Figure 4: Distribution of percentage deviation from the length target for our two final models compared to the unoptimized baseline across four different types of length requirement. Note the different y-axes in the plots.