Table of Contents
Fetching ...

InDRiVE: Reward-Free World-Model Pretraining for Autonomous Driving via Latent Disagreement

Feeza Khan Khanzada, Jaerock Kwon

TL;DR

The paper tackles the challenge of task-reward dependence in autonomous driving by proposing InDRiVE, a reward-free pretraining framework that uses latent ensemble disagreement as the sole intrinsic signal to train a DreamerV3-based world model. It introduces a strict two-phase transfer protocol: true zero-shot evaluation with frozen parameters in unseen towns, followed by a 10k-step few-shot adaptation to lane following and collision avoidance, and compares disagreement against ICM and RND baselines under identical backbones. Experiments in CARLA demonstrate that latent disagreement yields superior zero-shot robustness and robust few-shot safety, particularly under distribution shift, highlighting the potential of intrinsic exploration for reusable driving representations. These results suggest that carefully designed reward-free pretraining can reduce reliance on manual reward engineering while enabling rapid adaptation to new driving scenarios and safety-critical tasks.

Abstract

Model-based reinforcement learning (MBRL) can reduce interaction cost for autonomous driving by learning a predictive world model, but it typically still depends on task-specific rewards that are difficult to design and often brittle under distribution shift. This paper presents InDRiVE, a DreamerV3-style MBRL agent that performs reward-free pretraining in CARLA using only intrinsic motivation derived from latent ensemble disagreement. Disagreement acts as a proxy for epistemic uncertainty and drives the agent toward under-explored driving situations, while an imagination-based actor-critic learns a planner-free exploration policy directly from the learned world model. After intrinsic pretraining, we evaluate zero-shot transfer by freezing all parameters and deploying the pretrained exploration policy in unseen towns and routes. We then study few-shot adaptation by training a task policy with limited extrinsic feedback for downstream objectives (lane following and collision avoidance). Experiments in CARLA across towns, routes, and traffic densities show that disagreement-based pretraining yields stronger zero-shot robustness and robust few-shot collision avoidance under town shift and matched interaction budgets, supporting the use of intrinsic disagreement as a practical reward-free pretraining signal for reusable driving world models.

InDRiVE: Reward-Free World-Model Pretraining for Autonomous Driving via Latent Disagreement

TL;DR

The paper tackles the challenge of task-reward dependence in autonomous driving by proposing InDRiVE, a reward-free pretraining framework that uses latent ensemble disagreement as the sole intrinsic signal to train a DreamerV3-based world model. It introduces a strict two-phase transfer protocol: true zero-shot evaluation with frozen parameters in unseen towns, followed by a 10k-step few-shot adaptation to lane following and collision avoidance, and compares disagreement against ICM and RND baselines under identical backbones. Experiments in CARLA demonstrate that latent disagreement yields superior zero-shot robustness and robust few-shot safety, particularly under distribution shift, highlighting the potential of intrinsic exploration for reusable driving representations. These results suggest that carefully designed reward-free pretraining can reduce reliance on manual reward engineering while enabling rapid adaptation to new driving scenarios and safety-critical tasks.

Abstract

Model-based reinforcement learning (MBRL) can reduce interaction cost for autonomous driving by learning a predictive world model, but it typically still depends on task-specific rewards that are difficult to design and often brittle under distribution shift. This paper presents InDRiVE, a DreamerV3-style MBRL agent that performs reward-free pretraining in CARLA using only intrinsic motivation derived from latent ensemble disagreement. Disagreement acts as a proxy for epistemic uncertainty and drives the agent toward under-explored driving situations, while an imagination-based actor-critic learns a planner-free exploration policy directly from the learned world model. After intrinsic pretraining, we evaluate zero-shot transfer by freezing all parameters and deploying the pretrained exploration policy in unseen towns and routes. We then study few-shot adaptation by training a task policy with limited extrinsic feedback for downstream objectives (lane following and collision avoidance). Experiments in CARLA across towns, routes, and traffic densities show that disagreement-based pretraining yields stronger zero-shot robustness and robust few-shot collision avoidance under town shift and matched interaction budgets, supporting the use of intrinsic disagreement as a practical reward-free pretraining signal for reusable driving world models.

Paper Structure

This paper contains 33 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of InDRiVE. During reward-free pretraining, the agent collects data using an exploration policy optimized only with intrinsic reward (LD/ICM/RND). The world model is trained from replay using reconstruction, dynamics, and continuation losses; no task reward is computed or used. Zero-shot transfer freezes all parameters and evaluates the pretrained exploration policy in unseen towns/routes. Few-shot fine-tuning then introduces task-specific rewards and trains a task policy for a small interaction budget; the dynamics model is kept frozen unless stated otherwise.
  • Figure 2: Overview of InDRiVE. (a) An actor critic policy architecture incorporating latent disagreement for exploration. LD is Latent Disagreement in (b). Raw images are encoded into a stochastic latent $s_t$, which is combined with deterministic hidden state $h_t$ to maintain temporal context. The actor--critic policy then outputs an action $a_t$ based on $u_t = [s_t, h_t]$. (b) An ensemble of forward models predicts potential next states $\hat{s}_{t+1}^{\,k}$ for the same $(s_t, a_t)$. The variance among these predictions yields a latent-disagreement (intrinsic) reward, which, encourages the policy to explore.
  • Figure 3: Evaluation routes in CARLA for transfer tests. Top‑down maps for Town01 (seen) (a) and Town02 (unseen) (b). Colored trajectories denote the four fixed routes: Straight (green), Right‑Turn loop (blue), Left‑Turn loop (yellow), and Two‑Turn loop (red); arrows indicate travel direction. Each route is evaluated under traffic densities 5, 10, 20 vehicles within 150 m and averaged across seeds.