Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

Ahmet Zahid Balcıoğlu; Newton Mwai; Emil Carlsson; Fredrik D. Johansson

Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

Ahmet Zahid Balcıoğlu, Newton Mwai, Emil Carlsson, Fredrik D. Johansson

TL;DR

The paper tackles the sample inefficiency of online bandits in personalized decision-making by introducing Identifiable Latent Bandits (ILB), which learn a latent state $Z$ that governs rewards across problem instances from historical observational data. It builds a two-stage offline-online framework: offline learning of an identifiable latent variable model (LVM) via nonlinear ICA-inspired mean-contrastive learning to recover $g^{-1}$ and $\theta$, and online use of the learned LVM to infer $\hat{z}_t$ and select actions with CPG, FPG, or FPG-TS. The authors prove partial identifiability up to an affine transform and demonstrate that, under linear reward means, the reward model and decision criteria can be identified from observational data, enabling more sample-efficient personalized decisions. Empirically, ILB approaches outperform fully online baselines and regression in synthetic and semi-synthetic Alzheimer's disease environments, with hybrid methods offering robustness under model misspecification and latent-noise. The work highlights the potential of leveraging historical data for rapid personalization while outlining key limitations and avenues for extending identifiability and time-varying latent structure.

Abstract

Sequential decision-making algorithms such as multi-armed bandits can find optimal personalized decisions, but are notoriously sample-hungry. In personalized medicine, for example, training a bandit from scratch for every patient is typically infeasible, as the number of trials required is much larger than the number of decision points for a single patient. To combat this, latent bandits offer rapid exploration and personalization beyond what context variables alone can offer, provided that a latent variable model of problem instances can be learned consistently. However, existing works give no guidance as to how such a model can be found. In this work, we propose an identifiable latent bandit framework that leads to optimal decision-making with a shorter exploration time than classical bandits by learning from historical records of decisions and outcomes. Our method is based on nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances. We verify this strategy in simulated and semi-synthetic environments, showing substantial improvement over online and offline learning baselines when identifying conditions are satisfied.

Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

TL;DR

The paper tackles the sample inefficiency of online bandits in personalized decision-making by introducing Identifiable Latent Bandits (ILB), which learn a latent state

that governs rewards across problem instances from historical observational data. It builds a two-stage offline-online framework: offline learning of an identifiable latent variable model (LVM) via nonlinear ICA-inspired mean-contrastive learning to recover

and

, and online use of the learned LVM to infer

and select actions with CPG, FPG, or FPG-TS. The authors prove partial identifiability up to an affine transform and demonstrate that, under linear reward means, the reward model and decision criteria can be identified from observational data, enabling more sample-efficient personalized decisions. Empirically, ILB approaches outperform fully online baselines and regression in synthetic and semi-synthetic Alzheimer's disease environments, with hybrid methods offering robustness under model misspecification and latent-noise. The work highlights the potential of leveraging historical data for rapid personalization while outlining key limitations and avenues for extending identifiability and time-varying latent structure.

Abstract

Paper Structure (50 sections, 6 theorems, 68 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 50 sections, 6 theorems, 68 equations, 15 figures, 4 tables, 1 algorithm.

Introduction
Contributions.
Problem setup
Additional related work
Identifiable latent bandits
Identifying assumptions on the data-generating process
How strong are the assumptions on $g$?
Offline stage: Identifying and estimating the latent variable model
Identifiability of reward model and decision-making criteria
Online stage: Estimation of the latent state & decision making
Experiments
Environments
LVM-based algorithms
Bandit baselines
Regression baseline
...and 35 more sections

Key Result

Theorem 3.3

Under Assumptions asmp:SEM--asm:learning, in the limit of infinite per-instance data, the optimal feature extractor $f^\star$, according to eq:obj, is equal to the inverse emission function $g^{-1}$ up to an invertible affine transformation. In other words, for constant invertible matrix $B \in \mathbb{R}^{d \times d}$, and $b \in \mathbb{R}^d$.

Figures (15)

Figure 1: Identifying the best treatment for a new patient using ILB. Offline, we learn a provably identifiable latent variable model (LVM) (see \ref{['thm:id_latent', 'thm:reward']}), assumed known a priori in previous latent bandit algorithms. Online, we apply a decision-making algorithm making use of the LVM (see \ref{['alg:greedy']}).
Figure 2: The structural causal model of \ref{['asmp:SEM']} for an example patient instance $i$. Dashed arrows indicate potential sources of confounding bias that our model can handle.
Figure 3: Cumulative regret results for ADCB, comparing ILB decision-making algorithms to baselines. Error bars indicate one standard error computed with 200 seeds. The LVMs are fitted across $I=100$ instances with $T_o = 200$ points each with $L=2$ layered model.
Figure 4: Cumulative regret for the Synthetic environment (left) comparing ILB decision-making algorithms to baselines, and comparative performance our algorithm under different exponential noise see \ref{['app:eta_dist']} for details). Error bars represent one standard error computed from 200 seeds. The LVMs are fitted across $I=100$ instances with $T_i = 200$ time points each with $L=2$ layered model.
Figure 5: Cumulative regret for out-of-distribution experiments with increased $\Delta z$ difference from the training distribution on the synthetic data. Error bars indicate standard error over 200 seeds.
...and 10 more figures

Theorems & Definitions (11)

Theorem 3.3: Identifiability of inverse emission function
Theorem 3.4
Theorem 3.5
Definition B.1: Identifiability of LVM
Definition B.2: Affine Identifiability
Lemma B.3: Hyvarinen2016
proof
Lemma D.1: Estimator for FPG
proof
proof
...and 1 more

Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

TL;DR

Abstract

Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (11)