Table of Contents
Fetching ...

Offline Clustering of Preference Learning with Active-data Augmentation

Jingyuan Liu, Fatemeh Ghaffari, Xuchuang Wang, Xutong Liu, Mohammad Hajiesmaili, Carlee Joe-Wong

TL;DR

This work addresses offline clustering of heterogeneous user preferences from fixed pairwise feedback, a setting where data are imbalanced and online interaction is limited. It introduces Off-C^2PL to learn cluster structures and aggregate data without coverage assumptions, and extends to A^2-Off-C^2PL that actively collects data along underrepresented dimensions to improve estimation. The authors derive suboptimality bounds that balance noise from aggregation with bias from heterogeneity, and demonstrate substantial empirical gains on synthetic data and Reddit TL;DR. By connecting clustering, offline data, and active sampling within a contextual logistic-bandit/BTL framework, the paper provides a principled pathway to personalized preference learning under realistic data constraints with practical implications for RLHF and recommender systems.

Abstract

Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user's preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.

Offline Clustering of Preference Learning with Active-data Augmentation

TL;DR

This work addresses offline clustering of heterogeneous user preferences from fixed pairwise feedback, a setting where data are imbalanced and online interaction is limited. It introduces Off-C^2PL to learn cluster structures and aggregate data without coverage assumptions, and extends to A^2-Off-C^2PL that actively collects data along underrepresented dimensions to improve estimation. The authors derive suboptimality bounds that balance noise from aggregation with bias from heterogeneity, and demonstrate substantial empirical gains on synthetic data and Reddit TL;DR. By connecting clustering, offline data, and active sampling within a contextual logistic-bandit/BTL framework, the paper provides a principled pathway to personalized preference learning under realistic data constraints with practical implications for RLHF and recommender systems.

Abstract

Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-CPL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-CPL. In this setting, our second algorithm, A-Off-CPL, actively selects samples that target the least-informative dimensions of the test user's preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.

Paper Structure

This paper contains 31 sections, 38 equations, 2 figures, 3 tables, 2 algorithms.

Figures (2)

  • Figure 1: Illustration of how active-data augmentation enhances performance by increasing the minimum eigenvalue of the information matrix. Left: Pure offline data suffers from underrepresented dimensions, limiting performance. Middle: Adding random offline samples offers limited improvement. Right: Actively selected samples focus on underrepresented dimensions, substantially increasing the minimum eigenvalue and improving performance.
  • Figure 2: Figures \ref{['fig:off_synth']} and \ref{['fig:off_Reddit']} correspond to performance in offline setting with insufficient data, Figures \ref{['fig:hyb_synth']} and \ref{['fig:hyb_Reddit']} correspond to performance in hybrid setting, Figures \ref{['fig:gamma_synth']} and \ref{['fig:gamma_Reddit']} correspond to the impact of dimension $d$, and Figures \ref{['fig:dim_synth']} and \ref{['fig:dim_Reddit']} correspond to the impact of clustering-threshold $\hat{\gamma}$.