Prompt-Based Value Steering of Large Language Models
Giulio Antonio Abbo, Tony Belpaeme
TL;DR
This work addresses the problem of aligning LLM outputs with human values in dynamic contexts by proposing a model-agnostic, prompt-only evaluation framework. It combines Schwartz's basic human values with a value detector (ValueNet_DeBERTa_v3) and a dialogue dataset to quantify how well a prompt candidate steers generated text toward target values, without altering model parameters. The scoring framework tracks gains, retentions, losses, and neutrals across values, normalizing to a per-value score $\hat{S}_v$ and an overall score $S$, enabling iterative prompt improvements. Case studies on Wizard-Vicuna-13B-Uncensored show that explicitly conditioning prompts on values can substantially increase value-aligned outputs (e.g., $S$ rising from $0.57$ to $0.83$), illustrating the practical potential of prompt-based value steering while highlighting limitations and directions for future work.
Abstract
Large language models are increasingly used in applications where alignment with human values is critical. While model fine-tuning is often employed to ensure safe responses, this technique is static and does not lend itself to everyday situations involving dynamic values and preferences. In this paper, we present a practical, reproducible, and model-agnostic procedure to evaluate whether a prompt candidate can effectively steer generated text toward specific human values, formalising a scoring method to quantify the presence and gain of target values in generated responses. We apply our method to a variant of the Wizard-Vicuna language model, using Schwartz's theory of basic human values and a structured evaluation through a dialogue dataset. With this setup, we compare a baseline prompt to one explicitly conditioned on values, and show that value steering is possible even without altering the model or dynamically optimising prompts.
