Table of Contents
Fetching ...

Contextual Preference Distribution Learning

Benjamin Hudson, Laurent Charlin, Emma Frejinger

Abstract

Decision-making problems often feature uncertainty stemming from heterogeneous and context-dependent human preferences. To address this, we propose a sequential learning-and-optimization pipeline to learn preference distributions and leverage them to solve downstream problems, for example risk-averse formulations. We focus on human choice settings that can be formulated as (integer) linear programs. In such settings, existing inverse optimization and choice modelling methods infer preferences from observed choices but typically produce point estimates or fail to capture contextual shifts, making them unsuitable for risk-averse decision-making. Using a bounded-variance score function gradient estimator, we train a predictive model mapping contextual features to a rich class of parameterizable distributions. This approach yields a maximum likelihood estimate. The model generates scenarios for unseen contexts in the subsequent optimization phase. In a synthetic ridesharing environment, our approach reduces average post-decision surprise by up to 114$\times$ compared to a risk-neutral approach with perfect predictions and up to 25$\times$ compared to leading risk-averse baselines.

Contextual Preference Distribution Learning

Abstract

Decision-making problems often feature uncertainty stemming from heterogeneous and context-dependent human preferences. To address this, we propose a sequential learning-and-optimization pipeline to learn preference distributions and leverage them to solve downstream problems, for example risk-averse formulations. We focus on human choice settings that can be formulated as (integer) linear programs. In such settings, existing inverse optimization and choice modelling methods infer preferences from observed choices but typically produce point estimates or fail to capture contextual shifts, making them unsuitable for risk-averse decision-making. Using a bounded-variance score function gradient estimator, we train a predictive model mapping contextual features to a rich class of parameterizable distributions. This approach yields a maximum likelihood estimate. The model generates scenarios for unseen contexts in the subsequent optimization phase. In a synthetic ridesharing environment, our approach reduces average post-decision surprise by up to 114 compared to a risk-neutral approach with perfect predictions and up to 25 compared to leading risk-averse baselines.
Paper Structure (29 sections, 14 equations, 6 figures, 1 table)

This paper contains 29 sections, 14 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Our sequential learning-and-optimization approach. (a) During training, we learn a model mapping contextual features to preference (objective function coefficient) distributions in the choice problem (an ILP). (b) At inference, we leverage the model to generate scenarios in a risk-averse estimate-then-optimize problem, allowing us to minimize risk introduced by uncertain human preferences.
  • Figure 2: Model performance across dataset scale and fidelity. Test loss (left) and post-decision surprise (right) as a function of training set size (rows) and observation fidelity (columns), for nine random seeds. Increasing the number of instances ($|\mathcal{I}|$) helps the model learn the context-to-distribution mapping, while increasing the number of samples per instance ($|\mathcal{S}|$) reduces noise in the choice distribution statistics ($\bar{\boldsymbol{\phi}}$).
  • Figure 3: Post-decision surprise (top) and disappointment (bottom) as a function of training set size (rows) and observation fidelity (columns), aggregated over nine random seeds. While surprise penalizes early and late arrivals, disappointment only penalizes late arrivals. The combination of high surprise and low disappointment suggests overly conservative assignments.
  • Figure 4: Test loss as a function of training set size (rows) and observation fidelity (columns), aggregated over nine random seeds.
  • Figure 5: Recovery of the ground-truth preference distribution location (top) and scale (bottom) up to an affine transformation as a function of training set size (rows) and observation fidelity (columns), aggregated over nine random seeds. The maximum score is one and the minimum score is zero. Only REINFORCE and CPDL are able to learn parameters beyond the location.
  • ...and 1 more figures