Table of Contents
Fetching ...

Uncovering Factor Level Preferences to Improve Human-Model Alignment

Juhyun Oh, Eunsu Kim, Jiseon Kim, Wenda Xu, Inha Cha, William Yang Wang, Alice Oh

TL;DR

PROFILE addresses aligning LLM outputs with human preferences by introducing a factor-level analysis framework that decomposes preference into interpretable factors across generation and discrimination. It defines a factor taxonomy and computes factor influence scores $\tau(f)$, enabling cross-task comparisons across summarization, instruction-following, and document-based QA. The study finds that LLMs tend to over-prioritize length during generation, while discrimination tasks show stronger factor-level alignment, exposing a generation–discrimination gap. The authors further demonstrate that leveraging an LLM as an evaluator—via self-evaluation fine-tuning and feedback-driven generation—can improve generation alignment, offering practical pathways toward more human-aligned LLMs.

Abstract

Large language models (LLMs) often exhibit tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. While crucial for improvement, identifying the factors driving these misalignments remains challenging due to existing evaluation methods' reliance on coarse-grained comparisons and lack of explainability. To address this, we introduce PROFILE, an automated framework to uncover and measure factor-level preference alignment of humans and LLMs. Using PROFILE, we analyze preference alignment across three key tasks: summarization, instruction-following, and document-based QA. We find a significant discrepancy: while LLMs show poor factor-level alignment with human preferences when generating texts, they demonstrate strong alignment in discrimination tasks. We demonstrate how leveraging the identified generation-discrimination gap can be used to improve LLM alignment through multiple approaches, including fine-tuning with self-guidance. Our work highlights the value of factor-level analysis for identifying hidden misalignments and provides a practical framework for improving LLM-human preference alignment.

Uncovering Factor Level Preferences to Improve Human-Model Alignment

TL;DR

PROFILE addresses aligning LLM outputs with human preferences by introducing a factor-level analysis framework that decomposes preference into interpretable factors across generation and discrimination. It defines a factor taxonomy and computes factor influence scores , enabling cross-task comparisons across summarization, instruction-following, and document-based QA. The study finds that LLMs tend to over-prioritize length during generation, while discrimination tasks show stronger factor-level alignment, exposing a generation–discrimination gap. The authors further demonstrate that leveraging an LLM as an evaluator—via self-evaluation fine-tuning and feedback-driven generation—can improve generation alignment, offering practical pathways toward more human-aligned LLMs.

Abstract

Large language models (LLMs) often exhibit tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. While crucial for improvement, identifying the factors driving these misalignments remains challenging due to existing evaluation methods' reliance on coarse-grained comparisons and lack of explainability. To address this, we introduce PROFILE, an automated framework to uncover and measure factor-level preference alignment of humans and LLMs. Using PROFILE, we analyze preference alignment across three key tasks: summarization, instruction-following, and document-based QA. We find a significant discrepancy: while LLMs show poor factor-level alignment with human preferences when generating texts, they demonstrate strong alignment in discrimination tasks. We demonstrate how leveraging the identified generation-discrimination gap can be used to improve LLM alignment through multiple approaches, including fine-tuning with self-guidance. Our work highlights the value of factor-level analysis for identifying hidden misalignments and provides a practical framework for improving LLM-human preference alignment.

Paper Structure

This paper contains 77 sections, 7 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: PROFILE uncovers that models exhibit misalignments with human preferences when generating texts. While humans prioritize different quality factors for different tasks, models show consistent bias towards longer output.
  • Figure 2: An overview of PROFILE pipeline: (1) Extracting overall Response-level Preference, (2) Comparing factor manifestation in a pairwise manner, (3) Quantifying Factor Influence, and (4) Comparing human and model preference at the factor-level.
  • Figure 3: PROFILE uncovers the factor-level preferences of humans and models. Figure illustrates the comparison of factor-level preference alignment between humans, GPT-4o, and Gemini-1.5 in generation across three tasks: (a) Summarization, (b) Instruction-following, and (c) Document QA task. The left bar graphs display factor scores ($\tau_{14}$) for selected factors. The right tables show the rankings of all factors for each task. Notably, both models consistently rank 'length' as the top factor across tasks, while human preferences vary by task.
  • Figure 4: Pearson correlation between target conditioning scores and log probabilities of generated summaries for Mistral-7b and LLaMA-3.1-70b.
  • Figure 5: A screenshot of a sample summary with preference annotations.
  • ...and 1 more figures