Table of Contents
Fetching ...

Towards Adapting Reinforcement Learning Agents to New Tasks: Insights from Q-Values

Ashwin Ramaswamy, Ransalu Senanayake

TL;DR

The paper investigates how Deep Q-Networks (DQNs) can be adapted to related tasks by reusing and fine-tuning existing Q-value estimates, rather than retraining from scratch. Using a controlled grid-world setup and a scalable autonomous intersection scenario, it compares on-policy, exploration, expert demonstrations, and supervised-learning retraining strategies. The results show that starting from a base model whose Q-values closely approximate true values yields faster and more robust adaptation to new task specifications, with random exploration further improving Q-value accuracy. The study provides practical guidelines for sample-efficient task adaptation in value-based RL, with implications for legacy systems and real-world robotics where rapid retuning is valuable.

Abstract

While contemporary reinforcement learning research and applications have embraced policy gradient methods as the panacea of solving learning problems, value-based methods can still be useful in many domains as long as we can wrangle with how to exploit them in a sample efficient way. In this paper, we explore the chaotic nature of DQNs in reinforcement learning, while understanding how the information that they retain when trained can be repurposed for adapting a model to different tasks. We start by designing a simple experiment in which we are able to observe the Q-values for each state and action in an environment. Then we train in eight different ways to explore how these training algorithms affect the way that accurate Q-values are learned (or not learned). We tested the adaptability of each trained model when retrained to accomplish a slightly modified task. We then scaled our setup to test the larger problem of an autonomous vehicle at an unprotected intersection. We observed that the model is able to adapt to new tasks quicker when the base model's Q-value estimates are closer to the true Q-values. The results provide some insights and guidelines into what algorithms are useful for sample efficient task adaptation.

Towards Adapting Reinforcement Learning Agents to New Tasks: Insights from Q-Values

TL;DR

The paper investigates how Deep Q-Networks (DQNs) can be adapted to related tasks by reusing and fine-tuning existing Q-value estimates, rather than retraining from scratch. Using a controlled grid-world setup and a scalable autonomous intersection scenario, it compares on-policy, exploration, expert demonstrations, and supervised-learning retraining strategies. The results show that starting from a base model whose Q-values closely approximate true values yields faster and more robust adaptation to new task specifications, with random exploration further improving Q-value accuracy. The study provides practical guidelines for sample-efficient task adaptation in value-based RL, with implications for legacy systems and real-world robotics where rapid retuning is valuable.

Abstract

While contemporary reinforcement learning research and applications have embraced policy gradient methods as the panacea of solving learning problems, value-based methods can still be useful in many domains as long as we can wrangle with how to exploit them in a sample efficient way. In this paper, we explore the chaotic nature of DQNs in reinforcement learning, while understanding how the information that they retain when trained can be repurposed for adapting a model to different tasks. We start by designing a simple experiment in which we are able to observe the Q-values for each state and action in an environment. Then we train in eight different ways to explore how these training algorithms affect the way that accurate Q-values are learned (or not learned). We tested the adaptability of each trained model when retrained to accomplish a slightly modified task. We then scaled our setup to test the larger problem of an autonomous vehicle at an unprotected intersection. We observed that the model is able to adapt to new tasks quicker when the base model's Q-value estimates are closer to the true Q-values. The results provide some insights and guidelines into what algorithms are useful for sample efficient task adaptation.
Paper Structure (12 sections, 1 equation, 4 figures, 2 tables)

This paper contains 12 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Experiment 1: The Q-values for each state. The green and red squares represent the "good" and "poor" terminal states, respectively. Triangles that are highlighted in light blue represent the optimal action at each state, corresponding to the maximum Q-value at each state. The reward function used is described in the table above.
  • Figure 2: Experiment 2: The simulated environment. Cars are represented as circles. The ego vehicle is the red circle travelling horizontally to the left of the road. The ado vehicles are the other colored circles travelling vertically downwards in the perpendicular road. The reward function used is described in the table above.
  • Figure 3: The graph on the left plots the task accuracy as the model trains with supervised learning. The graph on the right plots the task accuracy as the model trains with deep Q-learning.
  • Figure 4: The graph on the left plots the accuracy of the model as it learns to adapt to the new task using the pre-trained supervised learning model as its base model. The graph on the right plots the accuracy of the model as it learns to adapt to the new task using the pre-trained DQN model as its base model.