Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learning

Donghu Kim; Hojoon Lee; Kyungmin Lee; Dongyoon Hwang; Jaegul Choo

Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learning

Donghu Kim, Hojoon Lee, Kyungmin Lee, Dongyoon Hwang, Jaegul Choo

TL;DR

The paper tackles how pre-training objectives shape generalization for vision-based RL under distribution shifts. It introduces Atari-PB, a unified benchmark that pre-trains a ResNet-50 on 10 million transitions from 50 Atari games and evaluates across ID, Near-OOD, and Far-OOD, using a range of data-type objectives including image, video, demonstrations, and trajectories. The key findings show that task-agnostic pre-training (capturing spatial and temporal structure) consistently improves generalization across distributions, while task-specific pre-training (demonstrations or rewards) benefits ID/Near-OOD but often harms Far-OOD performance; trajectory-based pre-training yields strong ID results. The results highlight the value of temporal dynamics in generalization and suggest future architectures that decouple task-agnostic from task-specific features to optimize downstream RL across diverse environments.

Abstract

Recently, various pre-training methods have been introduced in vision-based Reinforcement Learning (RL). However, their generalization ability remains unclear due to evaluations being limited to in-distribution environments and non-unified experimental setups. To address this, we introduce the Atari Pre-training Benchmark (Atari-PB), which pre-trains a ResNet-50 model on 10 million transitions from 50 Atari games and evaluates it across diverse environment distributions. Our experiments show that pre-training objectives focused on learning task-agnostic features (e.g., identifying objects and understanding temporal dynamics) enhance generalization across different environments. In contrast, objectives focused on learning task-specific knowledge (e.g., identifying agents and fitting reward functions) improve performance in environments similar to the pre-training dataset but not in varied ones. We publicize our codes, datasets, and model checkpoints at https://github.com/dojeon-ai/Atari-PB.

Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learning

TL;DR

Abstract

Paper Structure (51 sections, 4 equations, 12 figures, 33 tables)

This paper contains 51 sections, 4 equations, 12 figures, 33 tables.

Introduction
Related Work
Evaluating Generalization in Vision-Based RL
Pre-training for Generalization in Vision-based RL
Preliminaries
Reinforcement Learning
Pre-training for Reinforcement Learning
Algorithms
No pre-training
Learning from Image
Learning from Video
Learning from Demonstration
Learning from Trajectory
Atari Pre-training Benchmark
Dataset
...and 36 more sections

Figures (12)

Figure 1: Overview of Atari-PB. The ResNet-50-based model is pre-trained from 10M interactions with a given pre-training algorithm. The pre-trained model is then evaluated by fine-tuning to In-Distribution (ID), Near-Out-of-Distribution (Near-OOD), and Far-Out-of-Distribution (Far-OOD) environments.
Figure 2: Results Overview. The pre-training methods are evaluated by their performance after fine-tuning to environments in three groups: In-Distribution, Near-Out-of-Distribution, and Far-Out-of-Distribution. Here, we report the results of fine-tuning via behavior cloning (i.e., replicating expert behavior) and average the scores of each algorithm category for a comprehensive analysis.
Figure 3: Experimental Setup. The model is pre-trained with 50 Atari games in a multi-headed fashion (left), then fine-tuned for each game individually (right). The snowflake symbol indicates freezing the weights, whereas the fire symbol represents re-initializing and fine-tuning the component.
Figure 4: Main Results. Performance of each pre-training method after fine-tuning, in different distributions (ID, Near-OOD, Far-OOD) and adaptation scenarios (Offline BC, Online RL). We report the Inter Quantile Mean (IQM) of normalized scores across three seeds, along with a 95% confidence interval. The bars are grouped and color-coded by their categories described in Section \ref{['subsection:method_class']} for ease of view.
Figure 5: Qualitative analysis of methods. EigenCAM visualization of the pre-trained backbones in 3 games: SpaceInvaders (ID), Assault (Near-OOD), and Surround (Far-OOD). The agents are marked in red circles for each game (first column). We chose one representative method for each algorithm class.
...and 7 more figures

Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learning

TL;DR

Abstract

Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)