Table of Contents
Fetching ...

AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification

Geonwoo Cho, Jaemoon Lee, Jaegyun Im, Subi Lee, Jihwan Lee, Sundong Kim

TL;DR

amped tackles the core challenge of balancing exploration and skill diversity in skill-based RL by unifying entropy-based exploration with contrastive diversity objectives. It introduces gradient surgery to mitigate conflicting updates and a SAC-based skill selector to adapt pretrained skills to downstream tasks. Theoretical analysis links skill diversity to reduced fine-tuning sample complexity, and extensive experiments on Tree Maze and URLB demonstrate state-of-the-art performance and robust ablations. The work advances principled, generalizable SBRL by explicitly harmonizing exploration and diversity through gradient-aware optimization.

Abstract

Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both: during pre-training, a gradient-surgery projection balances the exploration and diversity gradients, and during fine-tuning, a skill selector exploits the learned diversity by choosing skills suited to downstream tasks. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. Through an extensive ablation study, we identify the role of each component and demonstrate that each element in AMPED is contributing to performance. We further provide theoretical and empirical evidence that, with a greedy skill selector, greater skill diversity reduces fine-tuning sample complexity. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: https://geonwoo.me/amped/

AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification

TL;DR

amped tackles the core challenge of balancing exploration and skill diversity in skill-based RL by unifying entropy-based exploration with contrastive diversity objectives. It introduces gradient surgery to mitigate conflicting updates and a SAC-based skill selector to adapt pretrained skills to downstream tasks. Theoretical analysis links skill diversity to reduced fine-tuning sample complexity, and extensive experiments on Tree Maze and URLB demonstrate state-of-the-art performance and robust ablations. The work advances principled, generalizable SBRL by explicitly harmonizing exploration and diversity through gradient-aware optimization.

Abstract

Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both: during pre-training, a gradient-surgery projection balances the exploration and diversity gradients, and during fine-tuning, a skill selector exploits the learned diversity by choosing skills suited to downstream tasks. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. Through an extensive ablation study, we identify the role of each component and demonstrate that each element in AMPED is contributing to performance. We further provide theoretical and empirical evidence that, with a greedy skill selector, greater skill diversity reduces fine-tuning sample complexity. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: https://geonwoo.me/amped/

Paper Structure

This paper contains 49 sections, 2 theorems, 18 equations, 32 figures, 15 tables, 3 algorithms.

Key Result

Theorem 1

Define $\delta= \min_{i\neq j} d\bigl(\rho_{z_i},\rho_{z_j}\bigr)$, $\varepsilon = d\bigl(\rho,\rho_{z_\star}\bigr)$. Assume that the skills are sufficiently diversified, so that $\Delta \;\equiv\; \delta - 2\varepsilon \;>\; 0$. Draw $n$ i.i.d. trajectories from target policy $S^{(1)},\dots,S^{(n)} In terms of confidence level $\eta\in (0,1)$, if we have $\Pr [\widehat{z}\neq z_\star] \le \eta$.

Figures (32)

  • Figure 1: Graphical scheme explaining our method, AMPED. (a) At initialization, the skills exhibit small coverage that are close to each other in the task space. (b) During skill pretraining, exploration and diversity objectives encourage skills to widen and repel each regions. (c) In fine-tuning, the skill selector identifies the skill best aligned with the target task at each step. (d) The selected skill is further adapted via extrinsic rewards to maximize performance on the target task.
  • Figure 2: Overview of the training process of AMPED. During the skill pretraining phase, the agent is conditioned on randomly sampled skills and optimized using intrinsic rewards for exploration and diversity. These gradients are not directly used, but are balanced via a gradient surgery mechanism. In fine-tuning phase, a skill selector adaptively selects skills on each step, based on task-specific feedback, and the agent is further optimized using extrinsic rewards from the downstream target task.
  • Figure 2: Exploration trajectories in the Square Maze with six skills. CeSD yields more contiguous coverage, while BeCL enforces stronger separation, leaving noticeable gaps.
  • Figure 3: Graphical illustration of gradient surgery. When diversity gradient (red) and exploration gradient (blue) conflict, one gradient is randomly projected onto the orthogonal complement of the other to balance updates. Added gradient (purple) is used for update of parameters.
  • Figure 6: Agents exploring on Tree Maze after pretrained from different skill discovery objectives. From (a) to (f) each are trained with six skills. Visually, our approach exhibits the most distinct skills while ensuring full coverage of the state space.
  • ...and 27 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof