Aligning Large Language Models with Counterfactual DPO

Bradley Butcher

Aligning Large Language Models with Counterfactual DPO

Bradley Butcher

TL;DR

This work introduces counterfactual prompting within the Direct Preference Optimization framework to align large language models with desired stylistic behaviors without relying on human-annotated data. By defining control, treatment, and negative prompts, and deploying variants such as Counterfactual ENC/DIS, Contrastive DPO, and Instruction Negation, the authors demonstrate improved control over bias, hallucinations, and instruction adherence. Across proof-of-concept and practical experiments on a 7B-instruction model, the approach reduces unwanted behavior (e.g., naming in summaries, pirate-slang, and biased responses) while preserving core reasoning capabilities, indicating a scalable path for self-supervised alignment in open-source contexts. The results suggest that Contrastive DPO offers robust performance, with potential for iterative, multi-style embedding and adherence to evolving regulatory standards for responsible AI deployment.

Abstract

Advancements in large language models (LLMs) have demonstrated remarkable capabilities across a diverse range of applications. These models excel in generating text completions that are contextually coherent and cover an extensive array of subjects. However, the vast datasets required for their training make aligning response styles during the pretraining and instruction tuning phases challenging. Consequently, an additional alignment phase is typically employed, wherein the model is further trained with human preference data to better align its outputs with human expectations. While this process doesn't introduce new capabilities per se, it does accentuate generation styles innate to the model. This paper explores the utilization of counterfactual prompting within the framework of Direct Preference Optimization (DPO) to align the model's style without relying on human intervention. We demonstrate that this method effectively instils desirable behaviour, mitigates undesirable ones, and encourages the model to disregard inappropriate instructions. Our findings suggest that counterfactual prompting with DPO presents a low-resource way to fine-tune LLMs to meet the demands for responsible and ethically aligned AI systems.

Aligning Large Language Models with Counterfactual DPO

TL;DR

Abstract

Paper Structure (18 sections, 7 equations, 1 figure, 3 tables)

This paper contains 18 sections, 7 equations, 1 figure, 3 tables.

Introduction
Background
Related Work
Method
Experiments
Proof of Concept Experiments
Entity Redactor
Highly Critical Summariser
Practical Results
Reducing Bias in LLM Responses
Reducing Hallucination
Ignoring Instructions
Pirates are banned
Entity Redactor Revisited
Discussion
...and 3 more sections

Figures (1)

Figure 1: An illustration of DPO and Counterfactual DPO (ENC). On the left, DPO is as normal, where human preference information is used to fine-tune the LLM policy via maximum likelihood. On the right, counterfactual DPO is used, desired style information is used to generate treatment and control prompts and responses. The treatment is assumed to be the preferred generation. DPO proceeds as normal, assuming the control prompt was used to generate both responses. We include diagrams of the other method configurations in supplementary material.

Aligning Large Language Models with Counterfactual DPO

TL;DR

Abstract

Aligning Large Language Models with Counterfactual DPO

Authors

TL;DR

Abstract

Table of Contents

Figures (1)