Table of Contents
Fetching ...

Dr. Strategy: Model-Based Generalist Agents with Strategic Dreaming

Hany Hamed, Subin Kim, Dongyeong Kim, Jaesik Yoon, Sungjin Ahn

TL;DR

Dr. Strategy tackles the inefficiency of pixel-based model-based generalist RL by introducing strategic dreaming, a divide-and-conquer planning paradigm that uses latent landmarks to structure dreaming. The agent learns a discrete landmark representation via VQ-VAE and employs three specialized policies—Highway to landmarks, Explorer for dreaming-driven exploration, and Achiever for goal attainment—alongside Focused Sampling to improve local precision. DREAMing occurs in a world model (RSSM), enabling efficient, zero-shot planning across visually complex, partially observable tasks, with strong results in 2D/3D navigation and competitive RoboKitchen performance. Overall, the work advances MBRL by coupling structured latent representations with modular, goal-directed policies, yielding improved sample efficiency and robust generalization to unseen goals, while opening avenues for adaptive landmark scaling and hierarchical planning enhancements.

Abstract

Model-based reinforcement learning (MBRL) has been a primary approach to ameliorating the sample efficiency issue as well as to make a generalist agent. However, there has not been much effort toward enhancing the strategy of dreaming itself. Therefore, it is a question whether and how an agent can "dream better" in a more structured and strategic way. In this paper, inspired by the observation from cognitive science suggesting that humans use a spatial divide-and-conquer strategy in planning, we propose a new MBRL agent, called Dr. Strategy, which is equipped with a novel Dreaming Strategy. The proposed agent realizes a version of divide-and-conquer-like strategy in dreaming. This is achieved by learning a set of latent landmarks and then utilizing these to learn a landmark-conditioned highway policy. With the highway policy, the agent can first learn in the dream to move to a landmark, and from there it tackles the exploration and achievement task in a more focused way. In experiments, we show that the proposed model outperforms prior pixel-based MBRL methods in various visually complex and partially observable navigation tasks.

Dr. Strategy: Model-Based Generalist Agents with Strategic Dreaming

TL;DR

Dr. Strategy tackles the inefficiency of pixel-based model-based generalist RL by introducing strategic dreaming, a divide-and-conquer planning paradigm that uses latent landmarks to structure dreaming. The agent learns a discrete landmark representation via VQ-VAE and employs three specialized policies—Highway to landmarks, Explorer for dreaming-driven exploration, and Achiever for goal attainment—alongside Focused Sampling to improve local precision. DREAMing occurs in a world model (RSSM), enabling efficient, zero-shot planning across visually complex, partially observable tasks, with strong results in 2D/3D navigation and competitive RoboKitchen performance. Overall, the work advances MBRL by coupling structured latent representations with modular, goal-directed policies, yielding improved sample efficiency and robust generalization to unseen goals, while opening avenues for adaptive landmark scaling and hierarchical planning enhancements.

Abstract

Model-based reinforcement learning (MBRL) has been a primary approach to ameliorating the sample efficiency issue as well as to make a generalist agent. However, there has not been much effort toward enhancing the strategy of dreaming itself. Therefore, it is a question whether and how an agent can "dream better" in a more structured and strategic way. In this paper, inspired by the observation from cognitive science suggesting that humans use a spatial divide-and-conquer strategy in planning, we propose a new MBRL agent, called Dr. Strategy, which is equipped with a novel Dreaming Strategy. The proposed agent realizes a version of divide-and-conquer-like strategy in dreaming. This is achieved by learning a set of latent landmarks and then utilizing these to learn a landmark-conditioned highway policy. With the highway policy, the agent can first learn in the dream to move to a landmark, and from there it tackles the exploration and achievement task in a more focused way. In experiments, we show that the proposed model outperforms prior pixel-based MBRL methods in various visually complex and partially observable navigation tasks.
Paper Structure (26 sections, 4 equations, 20 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 4 equations, 20 figures, 4 tables, 1 algorithm.

Figures (20)

  • Figure 1: (Left) In the real world, humans maintain a hierarchical spatial structure for easy navigation. (Right) Trying to memorize all the streets on the map can lead to an overwhelming amount of information, making it difficult to retain the information effectively. (Middle) In contrast, choosing to travel by train to move between cities and transfer to a taxi at the terminal minimizes the complexity, allowing one to concentrate on local routes starting from the terminal near the destination.
  • Figure 2: Comparison between Dr. Strategy and LEXA.a. We construct latent landmarks and train Highway policy $\pi_{l}(a_t|s_t, l)$, Explorer $\pi_{e}(a_t|s_t)$, and Achiever $\pi_g(a_t|s_t, e_g)$ in imagination. The Achiever is trained by Focused Sampling, which is conditioning goals within a small number of steps instead of random sampling. All three policies are purely trained with imagined trajectories from the world model. b. During exploration, we only evaluate the landmarks, and call the landmark with the highest exploration potential "Curious Landmark" (C-Landmark). In a real environment, the Highway policy moves to the curious landmark, and the Explorer resumes exploration. The agent iterates training and exploration with a certain frequency $T_F$. c. During test time, we find the landmark that is nearest to the given pixel-level goal (G-Landmark). The Highway policy reaches G-Landmark, and the Achiever proceeds to achieve the goal immediately after. The blue boxes in the bottom half of the figure indicate the modules of LEXA, which are Explorer and Achiever without focused sampling and landmarks.
  • Figure 3: Environments. We evaluate our agent across three different environments: 2D Navigation, 3D-Maze Navigation, and RoboKitchen. In these navigation environments, the agent's views are partially observable and visualized on the left. The top-left and bottom-left images represent the agent's initial view in the 2D and 3D Navigation settings, respectively. The second and third columns depict the top-down views of the 2D and 3D Navigation environments, respectively.
  • Figure 4: Zero-shot evaluation of the baselines across different environments. Each baseline is evaluated given a goal image from the environment's test set. Dr. Strategy significantly outperforms other baselines in most of the navigation tasks, while achieving comparable results in RoboKitchen. The success rate is reported with the mean and standard deviation across 3 different random seeds.
  • Figure 5: Evaluation trajectories visualization in 25-room for Dr. Strategy and LEXA.(Top) Ten evaluation trajectories per goal are visualized. All trajectories start from the top-left cell and head towards the desired goals positioned in the middle of each room. The red and blue lines indicate failed and successful trajectories, respectively. (Bottom) Trajectories (A), (C) aim to reach Goal 1 while (B), (D) aim to reach Goal 2. Dr. Strategy's trajectory (C) successfully reaches Goal 1 with precision due to focused sampling, unlike LEXA's trajectory (A). For Goal 2, trajectory (D) demonstrates the advantages of exploiting highway policy by finding the goal's vicinity, a capability lacking in trajectory (B) with flat models.
  • ...and 15 more figures