Table of Contents
Fetching ...

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

Yi-Lin Tuan, Xilun Chen, Eric Michael Smith, Louis Martin, Soumya Batra, Asli Celikyilmaz, William Yang Wang, Daniel M. Bikel

TL;DR

Balancing safety and helpfulness in LLMs is critical for user experience. The authors introduce a controllable framework that uses control tokens, self-generated data, and data distillation (MOEC) to tune LLM outputs toward desired safety/helpfulness levels without extra annotations. They compare training-free baselines and fine-tuning objectives (CLM, ExMATE, RLHF) and find ExMATE generally provides the strongest overall control, though disentanglement between attributes remains challenging. The results show self-generated data can rewind an aligned model to unlock controllability in a cost-effective way, enabling scenario-aware adjustments for diverse applications.

Abstract

As large language models (LLMs) become easily accessible nowadays, the trade-off between safety and helpfulness can significantly impact user experience. A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm. Possible harms include teaching people how to build a bomb, exposing youth to inappropriate content, and hurting users' mental health. In this work, we propose to balance safety and helpfulness in diverse use cases by controlling both attributes in LLM. We explore training-free and fine-tuning methods that do not require extra human annotations and analyze the challenges of controlling safety and helpfulness in LLMs. Our experiments demonstrate that our method can rewind a learned model and unlock its controllability.

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

TL;DR

Balancing safety and helpfulness in LLMs is critical for user experience. The authors introduce a controllable framework that uses control tokens, self-generated data, and data distillation (MOEC) to tune LLM outputs toward desired safety/helpfulness levels without extra annotations. They compare training-free baselines and fine-tuning objectives (CLM, ExMATE, RLHF) and find ExMATE generally provides the strongest overall control, though disentanglement between attributes remains challenging. The results show self-generated data can rewind an aligned model to unlock controllability in a cost-effective way, enabling scenario-aware adjustments for diverse applications.

Abstract

As large language models (LLMs) become easily accessible nowadays, the trade-off between safety and helpfulness can significantly impact user experience. A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm. Possible harms include teaching people how to build a bomb, exposing youth to inappropriate content, and hurting users' mental health. In this work, we propose to balance safety and helpfulness in diverse use cases by controlling both attributes in LLM. We explore training-free and fine-tuning methods that do not require extra human annotations and analyze the challenges of controlling safety and helpfulness in LLMs. Our experiments demonstrate that our method can rewind a learned model and unlock its controllability.
Paper Structure (34 sections, 8 equations, 7 figures, 5 tables)

This paper contains 34 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We expect that a model generates more safe or more helpful responses in different situations given the same input.
  • Figure 2: While a pretrained LLM can have performed (a) supervised fine-tuning (SFT) and (b) RLHF, (c) our paradigm enables the model's controllability with (1) self-generation by reutilizing the training data $X$ and reward models ($RM_{hp}$ and $RM_{sf}$) as well as (2) data distillation to denoise and prevent backdoor.
  • Figure 3: Our proposed finetuning methods for controlling LLMs based on ExMATE Tuan2022CausalDialogueMU or RLHF ouyang2022training.
  • Figure 4: The score distribution of our synthetic MOEC data. The helpful but unsafe responses are rare.
  • Figure 5: The posterior distribution of the scores of generated responses given the input control. Reranking shows not less helpful in controlling response by giving no examples in certain cases (the left upper corner in (a) is blank).
  • ...and 2 more figures