Table of Contents
Fetching ...

Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets

Ike Obi, Rohan Pant, Srishti Shekhar Agrawal, Maham Ghazanfar, Aaron Basiletti

TL;DR

Value Imprint, a framework for auditing and classifying the human values embedded within RLHF datasets, is introduced and it is discovered that information-utility values were the most dominant human values within all three RLHF datasets.

Abstract

LLMs are increasingly fine-tuned using RLHF datasets to align them with human preferences and values. However, very limited research has investigated which specific human values are operationalized through these datasets. In this paper, we introduce Value Imprint, a framework for auditing and classifying the human values embedded within RLHF datasets. To investigate the viability of this framework, we conducted three case study experiments by auditing the Anthropic/hh-rlhf, OpenAI WebGPT Comparisons, and Alpaca GPT-4-LLM datasets to examine the human values embedded within them. Our analysis involved a two-phase process. During the first phase, we developed a taxonomy of human values through an integrated review of prior works from philosophy, axiology, and ethics. Then, we applied this taxonomy to annotate 6,501 RLHF preferences. During the second phase, we employed the labels generated from the annotation as ground truth data for training a transformer-based machine learning model to audit and classify the three RLHF datasets. Through this approach, we discovered that information-utility values, including Wisdom/Knowledge and Information Seeking, were the most dominant human values within all three RLHF datasets. In contrast, prosocial and democratic values, including Well-being, Justice, and Human/Animal Rights, were the least represented human values. These findings have significant implications for developing language models that align with societal values and norms. We contribute our datasets to support further research in this area.

Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets

TL;DR

Value Imprint, a framework for auditing and classifying the human values embedded within RLHF datasets, is introduced and it is discovered that information-utility values were the most dominant human values within all three RLHF datasets.

Abstract

LLMs are increasingly fine-tuned using RLHF datasets to align them with human preferences and values. However, very limited research has investigated which specific human values are operationalized through these datasets. In this paper, we introduce Value Imprint, a framework for auditing and classifying the human values embedded within RLHF datasets. To investigate the viability of this framework, we conducted three case study experiments by auditing the Anthropic/hh-rlhf, OpenAI WebGPT Comparisons, and Alpaca GPT-4-LLM datasets to examine the human values embedded within them. Our analysis involved a two-phase process. During the first phase, we developed a taxonomy of human values through an integrated review of prior works from philosophy, axiology, and ethics. Then, we applied this taxonomy to annotate 6,501 RLHF preferences. During the second phase, we employed the labels generated from the annotation as ground truth data for training a transformer-based machine learning model to audit and classify the three RLHF datasets. Through this approach, we discovered that information-utility values, including Wisdom/Knowledge and Information Seeking, were the most dominant human values within all three RLHF datasets. In contrast, prosocial and democratic values, including Well-being, Justice, and Human/Animal Rights, were the least represented human values. These findings have significant implications for developing language models that align with societal values and norms. We contribute our datasets to support further research in this area.

Paper Structure

This paper contains 92 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Value Imprint is a technique for auditing the human values embedded within RLHF datasets using an AI-focused human values taxonomy.
  • Figure 2: This image presents a visual version of the taxonomy that supported our audit. [See Table \ref{['fig:table1']} and Appendix \ref{['appendix']} for the complete description and citation of the human values taxonomy.]
  • Figure 3: This heatmap compares how the human values embedded within the three RLHF datasets differ, showing that all the three datasets were oriented toward information-utility and less toward prosocial values.