Table of Contents
Fetching ...

Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts

Xianwei Cao, Dou Quan, Zhenliang Zhang, Shuang Wang

Abstract

Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs. In queueing, maze, and multi-objective continuous-control environments with event-driven changes in objectives, DPI adapts its inferred preferences to new regimes and achieves higher post-shift performance than fixed-weight and heuristic envelope baselines.

Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts

Abstract

Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs. In queueing, maze, and multi-objective continuous-control environments with event-driven changes in objectives, DPI adapts its inferred preferences to new regimes and achieves higher post-shift performance than fixed-weight and heuristic envelope baselines.
Paper Structure (43 sections, 33 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 43 sections, 33 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Adaptive value preference adjustment in a queueing scenario. At early stages ($t_1$), the agent prioritizes morality and chooses to wait. As the deadline approaches ($t_2$), preferences between morality (M) and energy (E) become balanced. When time is nearly exhausted ($t_3$), energy becomes dominant and the agent rationalizes cutting in line, illustrating dynamic reweighting of values under changing pressures.
  • Figure 2: Two-stage cognitive-inspired decision framework. History states are transformed into latent preferences via Value Appraisal, which in turn guide the Action Selection. The resulting policy drives environment execution, forming a dynamic decision pipeline analogous to human appraisal–action coupling.
  • Figure 3: Post-shift performance (PS@K) on Queue environment.
  • Figure 4: Event-aligned trajectories in Maze environment. After each event, DPI updates its preferences and modifies its behavior in a contextually appropriate way: (a) prioritizes shorter routes under deadline shock. (b) exhibits increased avoidance under hazard surge. (c) prefers waiting and selecting minimal-cost routes under energy drought. Arrows indicate agent motion; shaded regions mark environmental hazards or costs. (d) Alignment between inferred preferences and reward vectors. DPI maintains positive cosine similarity and sharply increases alignment after event onsets, whereas baselines remain near zero or negative, indicating that only DPI learns a value representation that tracks task semantics.
  • Figure 5: Ablation study results. (a) Post-Shift Performance (PS@K) curves over the first $K=8$ steps after each event. (b) Multi-step average PS@K, summarizing short-term recovery into a single metric for each method. (c) Mean episodic return (MER) as a function of history window size $H$. Across all plots, our full DPI agent consistently outperforms ablations, confirming the necessity of KL regularization, directional alignment, and self-consistency, and showing that performance is robust to $H$ beyond a small temporal context.
  • ...and 5 more figures