Table of Contents
Fetching ...

ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation

Xuerui Wang, Guangyu Ren, Tianhong Dai, Bintao Hu, Shuangyao Huang, Wenzhang Zhang, Hengyan Liu

TL;DR

A more comprehensive learning paradigm, ACDC, which integrates multidimensional Adaptive Curriculum (AC) Planning with Dynamic Contrastive (DC) Control to guide the agent along a well-designed learning trajectory, demonstrating that ACDC consistently outperforms the state-of-the-art baselines in both sample efficiency and final task success rate.

Abstract

Goal-conditioned reinforcement learning has shown considerable potential in robotic manipulation; however, existing approaches remain limited by their reliance on prioritizing collected experience, resulting in suboptimal performance across diverse tasks. Inspired by human learning behaviors, we propose a more comprehensive learning paradigm, ACDC, which integrates multidimensional Adaptive Curriculum (AC) Planning with Dynamic Contrastive (DC) Control to guide the agent along a well-designed learning trajectory. More specifically, at the planning level, the AC component schedules the learning curriculum by dynamically balancing diversity-driven exploration and quality-driven exploitation based on the agent's success rate and training progress. At the control level, the DC component implements the curriculum plan through norm-constrained contrastive learning, enabling magnitude-guided experience selection aligned with the current curriculum focus. Extensive experiments on challenging robotic manipulation tasks demonstrate that ACDC consistently outperforms the state-of-the-art baselines in both sample efficiency and final task success rate.

ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation

TL;DR

A more comprehensive learning paradigm, ACDC, which integrates multidimensional Adaptive Curriculum (AC) Planning with Dynamic Contrastive (DC) Control to guide the agent along a well-designed learning trajectory, demonstrating that ACDC consistently outperforms the state-of-the-art baselines in both sample efficiency and final task success rate.

Abstract

Goal-conditioned reinforcement learning has shown considerable potential in robotic manipulation; however, existing approaches remain limited by their reliance on prioritizing collected experience, resulting in suboptimal performance across diverse tasks. Inspired by human learning behaviors, we propose a more comprehensive learning paradigm, ACDC, which integrates multidimensional Adaptive Curriculum (AC) Planning with Dynamic Contrastive (DC) Control to guide the agent along a well-designed learning trajectory. More specifically, at the planning level, the AC component schedules the learning curriculum by dynamically balancing diversity-driven exploration and quality-driven exploitation based on the agent's success rate and training progress. At the control level, the DC component implements the curriculum plan through norm-constrained contrastive learning, enabling magnitude-guided experience selection aligned with the current curriculum focus. Extensive experiments on challenging robotic manipulation tasks demonstrate that ACDC consistently outperforms the state-of-the-art baselines in both sample efficiency and final task success rate.
Paper Structure (38 sections, 18 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 38 sections, 18 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overview of our ACDC framework. Planning Level: Adaptive Curriculum (AC) Planning establishes an adaptive curriculum by combining diversity and quality metrics through an adaptive weighting mechanism to generate trajectory priority scores $F(\tau)$. Control Level: Dynamic Contrastive (DC) Control leverages these scores through a ranking and pairing module to construct positive/negative pairs, then employs two-phase contrastive learning for experience selection. Phase I trains an encoder network to learn trajectory representations from constructed pairs, while Phase II computes L2-norm scores through agent-encoder interaction to dynamically guide experience selection.
  • Figure 2: Architecture of Adaptive Curriculum (AC) Planning. The framework evaluates trajectories through two complementary metrics: diversity scores and quality scores, which are normalized to $\tilde{d}_\tau$ and $\tilde{q}_\tau$ respectively. The weighting function generates adaptive parameter $\lambda(s_r,t)$ based on current success rate and learning progress. These components are combined through the adaptive weighting mechanism to produce trajectory scores: $F(\tau) = \tilde{d}_\tau + \lambda(s_r,t)\tilde{q}_\tau$, where the value of $\lambda(s_r,t)$ dynamically determines whether the agent operates in exploration (low $\lambda(s_r,t)$), balanced (medium $\lambda(s_r,t)$), or exploitation (high $\lambda(s_r,t)$) planning. Note: "$\Longrightarrow$" denotes normalization; achieved goals $g^{ac}$ refer to task-relevant components extracted from the observation space.
  • Figure 3: Quality score measures how close a trajectory's final achieved goal is to the desired goal. The score uses a Gaussian function to transform Euclidean distance into a value between 0 and 1, providing smooth differentiation between successful and unsuccessful trajectories.
  • Figure 4: Architecture of Dynamic Contrastive(DC) Control. This level operates in two phases: Phase I (Encoder Network Training) ranks trajectories using AC's combined scores $F(\tau)$ and selects positive and negative pairs. An LSTM-based sequence encoder extracts temporal features from achieved goals along the trajectory and fuses them with the adaptive weighting parameter $\lambda(s_r,t)$ through a fusion layer to create joint encodings $Z_{(\tau, \lambda(s_r,t))}$, then employs contrastive learning with a combined loss function to separate positive and negative examples while ensuring positive encodings exhibit larger norms than negative ones. Phase II (Experience Selection) uses the trained encoder network to process all trajectories $\tau \in$ current Replay Buffer $\mathcal{B}$ with $\lambda(s_r,t)$, computing L2-norms $\|Z_{\tau,\lambda(s_r,t)}\|_2$ as trajectory scores for dynamic experience selection that adapts to the agent's evolving capabilities.
  • Figure 5: The trained encoder network establishes a two-tier representation architecture. Positive trajectories occupy the premium region (green) while negative trajectories are confined to the inferior region (red), with point size indicating L2 norm magnitude. This structure enables both categorical separation and magnitude-based importance ranking.
  • ...and 8 more figures