The Indoor-Training Effect: unexpected gains from distribution shifts in the transition function

Serena Bono; Spandan Madan; Ishaan Grover; Mao Yasueda; Cynthia Breazeal; Hanspeter Pfister; Gabriel Kreiman

The Indoor-Training Effect: unexpected gains from distribution shifts in the transition function

Serena Bono, Spandan Madan, Ishaan Grover, Mao Yasueda, Cynthia Breazeal, Hanspeter Pfister, Gabriel Kreiman

TL;DR

The paper investigates how distribution shifts in transition dynamics affect reinforcement learning generalization, introducing Noise Injection to create δ-environments around a target MDP $M_T$. By comparing a Learnability agent trained and tested on the δ-environment $M_\delta$ with a Generalization agent trained on $M_T$ and tested on $M_\delta$, the authors reveal the Indoor-Training Effect: in many cases, training in the noise-free environment yields better test performance under noise. This counterintuitive finding holds across 60 MDP variations in three ATARI domains (PacMan, Pong, Breakout) and extends to semantic and non-semantic changes, with exploration patterns predicting the performance gap. The results persist in deep RL (DQN), suggesting that simple, controlled training environments can foster more robust policies for noisy deployment and have implications for robotics and transfer learning. Limitations include the Atari-focused domain and classical RL; future work could extend to real-world settings and broader DRL algorithms.

Abstract

Is it better to perform tennis training in a pristine indoor environment or a noisy outdoor one? To model this problem, here we investigate whether shifts in the transition probabilities between the training and testing environments in reinforcement learning problems can lead to better performance under certain conditions. We generate new Markov Decision Processes (MDPs) starting from a given MDP, by adding quantifiable, parametric noise into the transition function. We refer to this process as Noise Injection and the resulting environments as δ-environments. This process allows us to create variations of the same environment with quantitative control over noise serving as a metric of distance between environments. Conventional wisdom suggests that training and testing on the same MDP should yield the best results. In stark contrast, we observe that agents can perform better when trained on the noise-free environment and tested on the noisy δ-environments, compared to training and testing on the same δ-environments. We confirm that this finding extends beyond noise variations: it is possible to showcase the same phenomenon in ATARI game variations including varying Ghost behaviour in PacMan, and Paddle behaviour in Pong. We demonstrate this intriguing behaviour across 60 different variations of ATARI games, including PacMan, Pong, and Breakout. We refer to this phenomenon as the Indoor-Training Effect. Code to reproduce our experiments and to implement Noise Injection can be found at https://bit.ly/3X6CTYk.

The Indoor-Training Effect: unexpected gains from distribution shifts in the transition function

TL;DR

The paper investigates how distribution shifts in transition dynamics affect reinforcement learning generalization, introducing Noise Injection to create δ-environments around a target MDP

. By comparing a Learnability agent trained and tested on the δ-environment

with a Generalization agent trained on

and tested on

, the authors reveal the Indoor-Training Effect: in many cases, training in the noise-free environment yields better test performance under noise. This counterintuitive finding holds across 60 MDP variations in three ATARI domains (PacMan, Pong, Breakout) and extends to semantic and non-semantic changes, with exploration patterns predicting the performance gap. The results persist in deep RL (DQN), suggesting that simple, controlled training environments can foster more robust policies for noisy deployment and have implications for robotics and transfer learning. Limitations include the Atari-focused domain and classical RL; future work could extend to real-world settings and broader DRL algorithms.

Abstract

Paper Structure (23 sections, 4 equations, 58 figures, 1 table)

This paper contains 23 sections, 4 equations, 58 figures, 1 table.

Introduction
Related Works
Experimental Details
Results
Domains
PacMan
Pong
Breakout
Training Parameters
Additional Graphs Non-Semantic Variations
PacMan
Pong
Breakout
Additional Graphs Semantic variations
PacMan
...and 8 more sections

Figures (58)

Figure 1: ATARI games modified with Noise Injection. (a) In the original Target Environment($\mathcal{M}_T$), when the agent (PacMan) moves right, PacMan moves right with probability $1.0$. Noise Injection allows us to create multiple worlds in the vicinity of this environment by adding controlled Gaussian noise ($\delta$) to the original Transition Function ($T$). When the agent takes the action right in these $\delta-$environments, with a low probability the game may transition to a state which would not be possible in non-noisy PacMan. For brevity, we refer to these transitions as non-standard transitions which are $0$ probability in the original Target, but are now possible. Experiments with noise injection are presented on three ATARI games---(b) PacMan, (c) Pong, and (d) Breakout. (e) We compare two agents with these environments---a Learnability agent trained and tested on the same target environment ($\mathcal{M}_\delta)$, and a Generalization agent trained on a different MDP ($\mathcal{M}_T$) and tested on $\mathcal{M}_\delta$.
Figure 2: Schematic illustration of variations for Pacman.(a) Game dynamics when the agent picks the action right in a standard, non-noisy MDP for the v3 grid. The ghosts' actions follow a uniform probability distribution over possible moves and move up or right with an equal probability of $0.5$. This is referred to as a RandomGhost. (b) Grid variations for Pacman---v2, v3, and v4. These grids vary in size, positions of walls, and positions of food pellets. v2, v3, and v4 are designed to be increasingly harder. (c) Semantic variations whereby there is a meaningful change in the distribution of game elements. Here, a FollowingGhost is depicted which has a higher probability of taking a move that brings it closer to the Pacman ($0.8$). (d) Noise injected MDP generated by adding Gaussian noise to the standard transition function. Alongside states reachable by the ghost taking a legal move, non-standard transitions now become possible which result in the game reaching states otherwise unreachable.
Figure 3: Generalization agents can outperform Learnability agents. Results for PacMan v4 grid reporting mean reward as a function of episode number. (a) SARSA agent trained with a Boltzmann exploration strategy for Target MDPs generated with both high (solid line) and low (line with 'x' markers) level noise injection. The Generalization Agent (red) beats the Learnability Agent (green) (two-sided t-test, p$<$0.001). (b) The same result holds for a SARSA agent trained with the $\epsilon-$greedy exploration strategy. This finding also holds for Q-Learning agents trained with (c) Boltzmann and (d) $\epsilon-$greedy exploration strategies. Standard deviation across the 500 agents is reported as the error bar in all figures. However, the standard deviation is too small for these error bars to be visible.
Figure 4: Generalization can outperform Learnability across multiple variations of PacMan. Format and conventions as in Fig. \ref{['fig:fig_sarsa_q_learning']}. (a) Agents trained on the PacMan v2 grid with the Ghost dynamics set to the RandomGhost setting. (b) Agents trained on v2 with a DirectionalGhost with $p=0.3$. (c) DirectionalGhost with $p=0.6$. (d),(e),(f) Variations with the v3 grid with RandomGhost, DirectionGhost ($p=0.3$) and DirectionalGhost ($p=0.6$), respectively. All experiments are shown for SARSA agents trained with the $\epsilon-$greedy exploration strategy. Generalization agents consistently beat Learnability Agents (two-sided t-test, p$<$0.001). Corresponding results for agents trained with SARSA + Boltzmann exploration strategy, and for Q-Learning with both $\epsilon-$greedy and Boltzmann exploration strategies are shown in Figures \ref{['fig:atari_variations-pacman-sarsa-boltzmann']}- \ref{['fig:atari_variations-pacman-qlearning-egreedy']}.
Figure 5: Generalization agents outperform Learnability agents on Pong and Breakout as well. Format and conventions as in Fig. \ref{['fig:fig_sarsa_q_learning']}. Performance of SARSA agents trained with an $\epsilon-$greedy exploration strategy on (a) Pong p1 grid, (b) Pong p2 grid, (c) Breakout b1 grid, and (d) Breakout b2 grid. The Generalization Agent consistently beats the Learnability Agent (two-sided t-test, p$<$0.001).
...and 53 more figures

The Indoor-Training Effect: unexpected gains from distribution shifts in the transition function

TL;DR

Abstract

The Indoor-Training Effect: unexpected gains from distribution shifts in the transition function

Authors

TL;DR

Abstract

Table of Contents

Figures (58)