Table of Contents
Fetching ...

InDRiVE: Intrinsic Disagreement based Reinforcement for Vehicle Exploration through Curiosity Driven Generalized World Model

Feeza Khan Khanzada, Jaerock Kwon

TL;DR

The paper tackles generalization under sparse rewards in autonomous driving by proposing InDRiVE, a Dreamer-based model-based RL method that relies solely on intrinsic, disagreement-based rewards from an ensemble world model. It develops a latent world model using a Recurrent State-Space Model and latent-disagreement rewards to drive exploration, followed by a two-phase training procedure that enables zero-shot or few-shot adaptation to downstream tasks like lane following and collision avoidance. Empirical results in CARLA show InDRiVE achieves higher success rates and fewer infractions than DreamerV2 and DreamerV3 baselines while using significantly fewer training steps, and demonstrates robust zero-shot transfer to unseen towns with rapid fine-tuning when needed. The work highlights the viability of fully intrinsic exploration for learning robust, scalable driving policies and points to broader implications for self-supervised, multi-task autonomous systems.

Abstract

Model-based Reinforcement Learning (MBRL) has emerged as a promising paradigm for autonomous driving, where data efficiency and robustness are critical. Yet, existing solutions often rely on carefully crafted, task specific extrinsic rewards, limiting generalization to new tasks or environments. In this paper, we propose InDRiVE (Intrinsic Disagreement based Reinforcement for Vehicle Exploration), a method that leverages purely intrinsic, disagreement based rewards within a Dreamer based MBRL framework. By training an ensemble of world models, the agent actively explores high uncertainty regions of environments without any task specific feedback. This approach yields a task agnostic latent representation, allowing for rapid zero shot or few shot fine tuning on downstream driving tasks such as lane following and collision avoidance. Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines despite using significantly fewer training steps. Our findings highlight the effectiveness of purely intrinsic exploration for learning robust vehicle control behaviors, paving the way for more scalable and adaptable autonomous driving systems.

InDRiVE: Intrinsic Disagreement based Reinforcement for Vehicle Exploration through Curiosity Driven Generalized World Model

TL;DR

The paper tackles generalization under sparse rewards in autonomous driving by proposing InDRiVE, a Dreamer-based model-based RL method that relies solely on intrinsic, disagreement-based rewards from an ensemble world model. It develops a latent world model using a Recurrent State-Space Model and latent-disagreement rewards to drive exploration, followed by a two-phase training procedure that enables zero-shot or few-shot adaptation to downstream tasks like lane following and collision avoidance. Empirical results in CARLA show InDRiVE achieves higher success rates and fewer infractions than DreamerV2 and DreamerV3 baselines while using significantly fewer training steps, and demonstrates robust zero-shot transfer to unseen towns with rapid fine-tuning when needed. The work highlights the viability of fully intrinsic exploration for learning robust, scalable driving policies and points to broader implications for self-supervised, multi-task autonomous systems.

Abstract

Model-based Reinforcement Learning (MBRL) has emerged as a promising paradigm for autonomous driving, where data efficiency and robustness are critical. Yet, existing solutions often rely on carefully crafted, task specific extrinsic rewards, limiting generalization to new tasks or environments. In this paper, we propose InDRiVE (Intrinsic Disagreement based Reinforcement for Vehicle Exploration), a method that leverages purely intrinsic, disagreement based rewards within a Dreamer based MBRL framework. By training an ensemble of world models, the agent actively explores high uncertainty regions of environments without any task specific feedback. This approach yields a task agnostic latent representation, allowing for rapid zero shot or few shot fine tuning on downstream driving tasks such as lane following and collision avoidance. Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines despite using significantly fewer training steps. Our findings highlight the effectiveness of purely intrinsic exploration for learning robust vehicle control behaviors, paving the way for more scalable and adaptable autonomous driving systems.

Paper Structure

This paper contains 19 sections, 8 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the InDRiVE. (a) An actor critic policy architecture incorporating latent disagreement for exploration. LD is Latent Disagreement in (b). Raw images are encoded into a stochastic latent $s_t$, which is combined with deterministic hidden state $h_t$ to maintain temporal context. The actor--critic policy then outputs an action $a_t$ based on $[s_t, h_t]$. (b) An ensemble of forward models predicts potential next states $\hat{s}_{t+1}^{\,k}$ for the same $(s_t, a_t)$. The variance among these predictions yields a latent-disagreement (intrinsic) reward, which, encourages the policy to explore.
  • Figure 2: Average reward rates of InDRiVE (red), DreamerV3 (blue), and DreamerV2 (green) across three CARLA driving tasks. The gray area after 500K steps indicates the start of InDRiVE’s finetuning phase (few‐shot learning). Despite being trained on the extrinsic reward for fewer steps (10K), InDRiVE (red) converges to near‐optimal performance in all three tasks—surpassing both Dreamer baselines—and demonstrates superior sample efficiency and training stability overall.