Table of Contents
Fetching ...

Multi-Objective Linguistic Control of Large Language Models

Dang Nguyen, Jiuhai Chen, Tianyi Zhou

TL;DR

This work tackles the challenge of controlling multiple linguistic complexities in large language models. It introduces Multi-Control Tuning (MCTune), a simple, data-efficient approach that appends a vector of linguistic complexity features to instruction inputs and trains models to generate responses conditioned on selected controls. By finetuning LLaMA2-7B on existing instruction-tuning data (Alpaca-GPT4 and WizardLM) with randomized subsets of controls, MCTune achieves substantial improvements in controllability across multiple features while maintaining or even improving generation quality, as evidenced by MT-Bench and pairwise evaluations. The method leverages off-the-shelf data, a Gaussian-based sampling strategy for evaluation, and a sigma hyperparameter to adjust difficulty, demonstrating practical, scalable multi-objective linguistic control with broad implications for personalized and adaptive AI systems.

Abstract

Large language models (LLMs), despite their breakthroughs on many challenging benchmark tasks, lean to generate verbose responses and lack the controllability of output complexity, which is usually preferred by human users in practice. In this paper, we study how to precisely control multiple linguistic complexities of LLM output by finetuning using off-the-shelf data. To this end, we propose multi-control tuning (MCTune), which includes multiple linguistic complexity values of ground-truth responses as controls in the input for instruction tuning. We finetune LLaMA2-7B on Alpaca-GPT4 and WizardLM datasets. Evaluations on widely used benchmarks demonstrate that our method does not only improve LLMs' multi-complexity controllability substantially but also retains or even enhances the quality of the responses as a side benefit.

Multi-Objective Linguistic Control of Large Language Models

TL;DR

This work tackles the challenge of controlling multiple linguistic complexities in large language models. It introduces Multi-Control Tuning (MCTune), a simple, data-efficient approach that appends a vector of linguistic complexity features to instruction inputs and trains models to generate responses conditioned on selected controls. By finetuning LLaMA2-7B on existing instruction-tuning data (Alpaca-GPT4 and WizardLM) with randomized subsets of controls, MCTune achieves substantial improvements in controllability across multiple features while maintaining or even improving generation quality, as evidenced by MT-Bench and pairwise evaluations. The method leverages off-the-shelf data, a Gaussian-based sampling strategy for evaluation, and a sigma hyperparameter to adjust difficulty, demonstrating practical, scalable multi-objective linguistic control with broad implications for personalized and adaptive AI systems.

Abstract

Large language models (LLMs), despite their breakthroughs on many challenging benchmark tasks, lean to generate verbose responses and lack the controllability of output complexity, which is usually preferred by human users in practice. In this paper, we study how to precisely control multiple linguistic complexities of LLM output by finetuning using off-the-shelf data. To this end, we propose multi-control tuning (MCTune), which includes multiple linguistic complexity values of ground-truth responses as controls in the input for instruction tuning. We finetune LLaMA2-7B on Alpaca-GPT4 and WizardLM datasets. Evaluations on widely used benchmarks demonstrate that our method does not only improve LLMs' multi-complexity controllability substantially but also retains or even enhances the quality of the responses as a side benefit.

Paper Structure

This paper contains 38 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An example of how data is formatted before being fed into LLMs in this paper. The first paragraph presents a system prompt containing a complete list of feature descriptions. Our preliminary results indicate that including descriptions enhances the effectiveness of our approach.
  • Figure 2: Comparison of controllability error (the average normalized $L_1$ error) of ChatGPT (gpt-3.5-turbo-0125), IFT (finetuning without controls), and MCTune (ours) on AlpacaGPT-4 and WizardLM datasets and three test settings of target linguistic complexity with increasing $\sigma$ (difficulty). To visualize each linguistic feature's average error on the same radar plot, we apply min-max normalization to normalize them to a similar scale. To reduce the effect of outliers, the minimum $L_1$ error of the $i$-th feature is the minimum among all baselines, and the maximum $L_1$ error refers to the 95th percentile of errors among all baselines.
  • Figure 3: Trade-off between linguistic controllability and generation quality in three increasingly difficult test settings. Each dot represents a model's response $\hat{\mathbf{y}}$ to a specific query $\left[\mathbf{x}, f_C\right]$. The response is given a quality score from 1 to 10 by a judge LLM (GPT-4 Turbo) based on how well $\hat{\mathbf{y}}$ addresses $\mathbf{x}$. A controllability error is measured for $\hat{\mathbf{y}}$, which is computed by taking the average of normalized $L_1$ errors across all linguistic controls in $f_C$. Blue and orange dots respectively represent responses from models trained by IFT and MCTune using Alpaca-GPT4 dataset.
  • Figure 4: Comparison between MCTune and IFT-trained models on MT-Bench. We finetune LLaMA2-7B on Alpaca-GPT4 dataset and GPT-4 Turbo is the judge in the test. The average score per axis ranges from 1 to 10 and are given by the judge.
  • Figure 5: Linguistic controllability error on the test set as the number of linguistic controls $n$ per sample in MCTune's training increases. The solid curves represent the linguistic controllability error averaged over all linguistic complexities (lower is better) with the shaded areas represent the 95%-confidence interval. The dotted vertical line indicates the maximum number of controls (5) used during MCTune's training.
  • ...and 2 more figures