Table of Contents
Fetching ...

ContraSolver: Self-Alignment of Language Models by Resolving Internal Preference Contradictions

Xu Zhang, Xunjian Yin, Xiaojun Wan

TL;DR

This work addresses the problem of internal preference contradictions in large language models and introduces ContraSolver, a self-alignment method that constructs and globally regularizes a weighted preference graph over candidate responses. By initializing with a maximum spanning tree and employing reverse and forward loops, ContraSolver identifies low-confidence contradictory edges and integrates topological edges to produce a DAG, enabling unsupervised, globally consistent alignment data for Direct Preference Optimization. Empirical results across four generation tasks show improved performance and a marked reduction in internal contradictions, supporting the claim that resolving preference contradictions enhances both internal consistency and downstream generation quality. The approach offers a data-efficient, model-driven pathway to alignment that reduces dependence on human labels and iterative tuning, with potential broader impact on robust and safe LLM deployment.

Abstract

While substantial advancements have been made in developing large language models (LLMs), achieving control over their behavior can be difficult. Direct preference optimization (DPO) assumes the existence of a latent reward function to evaluate the responses of LLMs. This assumption indicates a strict preference ordering of different responses to the same input. However, there always exist contradictions of preference in LLMs according to our experimental observations. In this paper, we construct a graph structure of the preference relationship among different responses with self-annotation to find contradictions in the preference order. We propose ContraSolver, an algorithm that traverses all edges on the preference graph to identify those that might cause contradictions. ContraSolver initializes the graph with a maximum spanning tree and identifies contradictory edges, prioritizing the resolution of low-confidence preferences while preserving high-confidence ones. Experimental results on four different generation tasks show that the performance of different LLMs can be largely improved through our completely unsupervised self-alignment. Furthermore, by analyzing the preference graphs of LLMs with and without self-alignment by ContraSolver, we quantify the reduction in contradictions, suggesting that resolving preference contradictions is crucial for achieving better alignment performance.

ContraSolver: Self-Alignment of Language Models by Resolving Internal Preference Contradictions

TL;DR

This work addresses the problem of internal preference contradictions in large language models and introduces ContraSolver, a self-alignment method that constructs and globally regularizes a weighted preference graph over candidate responses. By initializing with a maximum spanning tree and employing reverse and forward loops, ContraSolver identifies low-confidence contradictory edges and integrates topological edges to produce a DAG, enabling unsupervised, globally consistent alignment data for Direct Preference Optimization. Empirical results across four generation tasks show improved performance and a marked reduction in internal contradictions, supporting the claim that resolving preference contradictions enhances both internal consistency and downstream generation quality. The approach offers a data-efficient, model-driven pathway to alignment that reduces dependence on human labels and iterative tuning, with potential broader impact on robust and safe LLM deployment.

Abstract

While substantial advancements have been made in developing large language models (LLMs), achieving control over their behavior can be difficult. Direct preference optimization (DPO) assumes the existence of a latent reward function to evaluate the responses of LLMs. This assumption indicates a strict preference ordering of different responses to the same input. However, there always exist contradictions of preference in LLMs according to our experimental observations. In this paper, we construct a graph structure of the preference relationship among different responses with self-annotation to find contradictions in the preference order. We propose ContraSolver, an algorithm that traverses all edges on the preference graph to identify those that might cause contradictions. ContraSolver initializes the graph with a maximum spanning tree and identifies contradictory edges, prioritizing the resolution of low-confidence preferences while preserving high-confidence ones. Experimental results on four different generation tasks show that the performance of different LLMs can be largely improved through our completely unsupervised self-alignment. Furthermore, by analyzing the preference graphs of LLMs with and without self-alignment by ContraSolver, we quantify the reduction in contradictions, suggesting that resolving preference contradictions is crucial for achieving better alignment performance.
Paper Structure (37 sections, 4 theorems, 6 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 37 sections, 4 theorems, 6 equations, 3 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

For any contradictory edge $(y_i, y_j) \in E_c$ identified by ContraSolver, the weight $w(y_i, y_j)$ is always lower than the weights of the heuristic edges $E_h$ added to the graph $\mathcal{G'}$ to resolve the contradiction.

Figures (3)

  • Figure 1: Detailed illustration of the process of ContraSolver traversing the preference graph. The graph is initialized with a maximum spanning tree. In the reverse loop, ContraSolver finds all edges that elicit contradictions to existing edges. In the forward loop, the algorithm omits untopological edges and adds the first topological edge to the graph. Finally, we obtain the heuristic edges(that is bold) for training.
  • Figure 2: An illustration of data construction. The construction of data can be divided into three steps: diverse generation, preference graph construction and preference data selection.
  • Figure 3: GPT-4 evaluation results on Instruction Following and Summarization. We report the winning rate of different methods given by GPT-4 compared with ContraSolver.

Theorems & Definitions (7)

  • Theorem 1: Local Optimality
  • Lemma 1
  • Proof A.1
  • Lemma 2
  • Proof A.2
  • Theorem 2: Global Consistency
  • Proof A.3