Table of Contents
Fetching ...

Missing Data Multiple Imputation for Tabular Q-Learning in Online RL

Kyla Chasalow, Skyler Wu, Susan Murphy

TL;DR

This work addresses the challenge of missing data in online tabular Q-learning by introducing online imputation ensembles that run multiple imputations in parallel to represent uncertainty and maintain computational efficiency. Imputations are generated online via learned transition models, updated with fractional learning, and combined through voting-based action selection. The Grid World experiments show that multiple imputation variants, especially with larger pathway counts (e.g., $K=10$), can outperform simple baselines and single imputation across MCAR, MCOLOR, and MFOG missingness mechanisms, though NMAR can introduce challenges and synthetic updates may be risky in such settings. The findings suggest that imputation ensembles are a promising, scalable framework for online RL with missing data, with broader implications for real-time decision-making under partial observability and uncertainty.

Abstract

Missing data in online reinforcement learning (RL) poses challenges compared to missing data in standard tabular data or in offline policy learning. The need to impute and act at each time step means that imputation cannot be put off until enough data exist to produce stable imputation models. It also means future data collection and learning depend on previous imputations. This paper proposes fully online imputation ensembles. We find that maintaining multiple imputation pathways may help balance the need to capture uncertainty under missingness and the need for efficiency in online settings. We consider multiple approaches for incorporating these pathways into learning and action selection. Using a Grid World experiment with various types of missingness, we provide preliminary evidence that multiple imputation pathways may be a useful framework for constructing simple and efficient online missing data RL methods.

Missing Data Multiple Imputation for Tabular Q-Learning in Online RL

TL;DR

This work addresses the challenge of missing data in online tabular Q-learning by introducing online imputation ensembles that run multiple imputations in parallel to represent uncertainty and maintain computational efficiency. Imputations are generated online via learned transition models, updated with fractional learning, and combined through voting-based action selection. The Grid World experiments show that multiple imputation variants, especially with larger pathway counts (e.g., ), can outperform simple baselines and single imputation across MCAR, MCOLOR, and MFOG missingness mechanisms, though NMAR can introduce challenges and synthetic updates may be risky in such settings. The findings suggest that imputation ensembles are a promising, scalable framework for online RL with missing data, with broader implications for real-time decision-making under partial observability and uncertainty.

Abstract

Missing data in online reinforcement learning (RL) poses challenges compared to missing data in standard tabular data or in offline policy learning. The need to impute and act at each time step means that imputation cannot be put off until enough data exist to produce stable imputation models. It also means future data collection and learning depend on previous imputations. This paper proposes fully online imputation ensembles. We find that maintaining multiple imputation pathways may help balance the need to capture uncertainty under missingness and the need for efficiency in online settings. We consider multiple approaches for incorporating these pathways into learning and action selection. Using a Grid World experiment with various types of missingness, we provide preliminary evidence that multiple imputation pathways may be a useful framework for constructing simple and efficient online missing data RL methods.

Paper Structure

This paper contains 41 sections, 8 theorems, 43 equations, 11 figures, 2 tables.

Key Result

Theorem 1.1

Let $X_1,...,X_t \overset{iid}{\sim} \text{Bernoulli}(p)$ with $\sigma^2 := p(1-p)$ , let $p_o$ be known, and let $\hat{p}_j$ be the average of $X_1,...,X_j$, with variance $\frac{\sigma^2}{j}$. Define: Then as $t\rightarrow\infty$, $\frac{V(\tilde{p}_t)}{V(\hat{p}_t)} \rightarrow \infty$.

Figures (11)

  • Figure 1: Pathways of imputations with $S_1$ and $S_4$ fully observed. Past imputations affect future imputations, though not the observed parts of future states. Actions, which also affect and are affected by states and imputations, are not depicted.
  • Figure 2: Illustration of Grid Worlds. From left to right: no flooding nor fog; flooding and no fog; no flooding and fog; flooding and fog. The "water" area is outlined in bold blue. The dark green and outlined in black states are start (left) and terminal (right) states. Regarding fog: the 3x3 white region in the upper-right corner of the third and fourth panels indicate the presence of fog: states enshrouded in the fog have a higher probability of missingness under MFOG. To clarify, "white" is not a possible value for $S_{t,3}$. The underlying $S_{t,3}$ values of all states in the third and fourth panels are the same as the corresponding values shown in the first and second panels, respectively.
  • Figure 3: Top: Comparison of mean performances over 50000 timesteps for all methods under MCAR with increasing missingness. Bottom: Performance over time for $\theta=0.4$. The y-axis in each case is the cumulative mean per episodes at time $t$ of the given metric. The environment is set to $P(\text{wind}) = 0.1$, $P(\text{flood\ transition}) = 0.1$. Each method shown for its best $\epsilon,\alpha,\gamma$, and action space option (stay in place allowed or not). Lines represent an average over 5 trials. Note the log scale for the center and right plots. The dashed blue (or red) line represent our multiple (or single) imputation model with "conservative" learning of the transition model for imputations.
  • Figure 4: Top: Performance over time for MCOLOR with missingness rates of $0.2$ (when in green), $0.4$ (when in orange), and $0.6$ (when in red) for all three of $x,y$ and color. Bottom: Performance over time for MFOG with missingness rate of $0.5$ inside the fog region and $0$ outside. The y-axis in each case is the cumulative mean per episodes at time $t$ of the given metric. The environment is set to $P(\text{wind}) = 0.1$, $P(\text{flood\ transition}) = 0.1$. Each method shown for its best $\epsilon,\alpha,\gamma$, and action space option (stay in place allowed or ot). Lines represent an average over 5 trials. Note the log scale for the center and right plots.
  • Figure 5: Magnitude of $\sum_{j=2}^{K} (-1)^{j-1}c_{j,K} r^j$ over range of $K$ with $r=1/K$.
  • ...and 6 more figures

Theorems & Definitions (16)

  • Theorem 1.1
  • proof : Proof of Theorem \ref{['thm-variance-pcsa']}
  • Theorem 1.2
  • proof
  • Theorem 1.3
  • proof
  • Theorem 1.4
  • proof
  • Theorem 1.5
  • proof
  • ...and 6 more