Table of Contents
Fetching ...

Self-supervised Attribute-aware Dynamic Preference Ranking Alignment

Hongyu Yang, Qi Zhao, Zhenhua hu, Rui Li

TL;DR

SeAdpra addresses the challenge of aligning LLMs to multifactor user preferences in community question answering without costly human labels. It introduces Attribute-Perceptual Distance Factors (APDF) and a Multi-APDF Matrix to quantify cross-attribute preference gaps, coupled with self-supervised dynamic ranking, perceptual alignment, and iterative perceptual comparison to achieve fine-grained, list-wise alignment. The framework is validated on StaCoCoQA, a large-scale programming CoQA dataset, and eight public CoQA domains, using the CSTC-compliant PrefHit/PrefRecall metrics and standard semantic metrics, where SeAdpra consistently outperforms baselines including PRO across models and domains. Security verification on PKU-SafeRLHF demonstrates that improving preference alignment can co-occur with enhanced safety, evidenced by higher SaferHit and reduced toxicity. Limitations include predefined attributes and potential trade-offs between preference optimization and general generation quality, guiding future work toward broader attributes and factual-coherence evaluation.

Abstract

Reinforcement Learning from Human Feedback and its variants excel in aligning with human intentions to generate helpful, harmless, and honest responses. However, most of them rely on costly human-annotated pairwise comparisons for supervised alignment, which is not suitable for list-level scenarios, such as community question answering. Additionally, human preferences are influenced by multiple intrinsic factors in responses, leading to decision-making inconsistencies. Therefore, we propose \textbf{Se}lf-supervised \textbf{A}ttribute-aware \textbf{d}ynamic \textbf{p}reference \textbf{ra}nking, called \shortname. \ It quantifies preference differences between responses based on Attribute-Perceptual Distance Factors (APDF) and dynamically determines the list-wise alignment order. Furthermore, it achieves fine-grained preference difference learning and enables precise alignment with the optimal one. We specifically constructed a challenging code preference dataset named StaCoCoQA, and introduced more cost-effective and scalable preference evaluation metrics: PrefHit and PrefRecall. Extensive experimental results show that SeAdpra exhibits superior performance and generalizability on both StaCoCoQA and preference datasets from eight popular domains.

Self-supervised Attribute-aware Dynamic Preference Ranking Alignment

TL;DR

SeAdpra addresses the challenge of aligning LLMs to multifactor user preferences in community question answering without costly human labels. It introduces Attribute-Perceptual Distance Factors (APDF) and a Multi-APDF Matrix to quantify cross-attribute preference gaps, coupled with self-supervised dynamic ranking, perceptual alignment, and iterative perceptual comparison to achieve fine-grained, list-wise alignment. The framework is validated on StaCoCoQA, a large-scale programming CoQA dataset, and eight public CoQA domains, using the CSTC-compliant PrefHit/PrefRecall metrics and standard semantic metrics, where SeAdpra consistently outperforms baselines including PRO across models and domains. Security verification on PKU-SafeRLHF demonstrates that improving preference alignment can co-occur with enhanced safety, evidenced by higher SaferHit and reduced toxicity. Limitations include predefined attributes and potential trade-offs between preference optimization and general generation quality, guiding future work toward broader attributes and factual-coherence evaluation.

Abstract

Reinforcement Learning from Human Feedback and its variants excel in aligning with human intentions to generate helpful, harmless, and honest responses. However, most of them rely on costly human-annotated pairwise comparisons for supervised alignment, which is not suitable for list-level scenarios, such as community question answering. Additionally, human preferences are influenced by multiple intrinsic factors in responses, leading to decision-making inconsistencies. Therefore, we propose \textbf{Se}lf-supervised \textbf{A}ttribute-aware \textbf{d}ynamic \textbf{p}reference \textbf{ra}nking, called \shortname. \ It quantifies preference differences between responses based on Attribute-Perceptual Distance Factors (APDF) and dynamically determines the list-wise alignment order. Furthermore, it achieves fine-grained preference difference learning and enables precise alignment with the optimal one. We specifically constructed a challenging code preference dataset named StaCoCoQA, and introduced more cost-effective and scalable preference evaluation metrics: PrefHit and PrefRecall. Extensive experimental results show that SeAdpra exhibits superior performance and generalizability on both StaCoCoQA and preference datasets from eight popular domains.

Paper Structure

This paper contains 42 sections, 26 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Which response should the LLMs align with? In the code community, each response has different attributes such as semantics, popularity, and timeliness, leading to potentially different optimal responses.
  • Figure 2: Showcasing the top-15 primary programming language categories in StaCoCoQA.
  • Figure 3: The overall framework of SeAdpra , which includes: (Part1.) Multi-attribute Perception for quantifying preference, containing the Construction of Multi-APDF Matrix and Self-supervised dynamic ranking; (Part2.) Perceptual Alignment for aligning the optimal ranks objective; (Part3.) Perceptual Comparison on all candidates for learning on-chain preference difference.
  • Figure 4: Implementation Workflow of Perceptual Comparison. In each round, the reward of the current positive is maximized, and the penalty for the remaining negative is minimized sequentially.
  • Figure 5: The performance with Confidence Interval (CI) of our SeAdpra and PRO at different data scales.
  • ...and 8 more figures