Table of Contents
Fetching ...

Auxiliary Reward Generation with Transition Distance Representation Learning

Siyuan Li, Shijie Han, Yingnan Zhao, By Liang, Peng Liu

TL;DR

This work tackles the challenge of designing effective rewards for reinforcement learning by learning a transition-distance representation that captures how many state transitions separate two states. It introduces Transition Distance Representation Learning (TDRP), trained with a contrastive objective so that latent distances reflect trajectory progress, and uses these embeddings to generate dense auxiliary rewards for both single-task goals and long-horizon skill chaining. Empirical results across Robosuite and Factory manipulation tasks show that TDRP-based auxiliary rewards improve learning efficiency, stabilizing convergence and outperforming several state-of-the-art representation learning and reward-shaping baselines. The approach is simple, data-efficient in the tested settings, and holds promise for broader use in robotics; future work includes scaling to very high-dimensional observations and integrating with hierarchical or offline RL frameworks.

Abstract

Reinforcement learning (RL) has shown its strength in challenging sequential decision-making problems. The reward function in RL is crucial to the learning performance, as it serves as a measure of the task completion degree. In real-world problems, the rewards are predominantly human-designed, which requires laborious tuning, and is easily affected by human cognitive biases. To achieve automatic auxiliary reward generation, we propose a novel representation learning approach that can measure the ``transition distance'' between states. Building upon these representations, we introduce an auxiliary reward generation technique for both single-task and skill-chaining scenarios without the need for human knowledge. The proposed approach is evaluated in a wide range of manipulation tasks. The experiment results demonstrate the effectiveness of measuring the transition distance between states and the induced improvement by auxiliary rewards, which not only promotes better learning efficiency but also increases convergent stability.

Auxiliary Reward Generation with Transition Distance Representation Learning

TL;DR

This work tackles the challenge of designing effective rewards for reinforcement learning by learning a transition-distance representation that captures how many state transitions separate two states. It introduces Transition Distance Representation Learning (TDRP), trained with a contrastive objective so that latent distances reflect trajectory progress, and uses these embeddings to generate dense auxiliary rewards for both single-task goals and long-horizon skill chaining. Empirical results across Robosuite and Factory manipulation tasks show that TDRP-based auxiliary rewards improve learning efficiency, stabilizing convergence and outperforming several state-of-the-art representation learning and reward-shaping baselines. The approach is simple, data-efficient in the tested settings, and holds promise for broader use in robotics; future work includes scaling to very high-dimensional observations and integrating with hierarchical or offline RL frameworks.

Abstract

Reinforcement learning (RL) has shown its strength in challenging sequential decision-making problems. The reward function in RL is crucial to the learning performance, as it serves as a measure of the task completion degree. In real-world problems, the rewards are predominantly human-designed, which requires laborious tuning, and is easily affected by human cognitive biases. To achieve automatic auxiliary reward generation, we propose a novel representation learning approach that can measure the ``transition distance'' between states. Building upon these representations, we introduce an auxiliary reward generation technique for both single-task and skill-chaining scenarios without the need for human knowledge. The proposed approach is evaluated in a wide range of manipulation tasks. The experiment results demonstrate the effectiveness of measuring the transition distance between states and the induced improvement by auxiliary rewards, which not only promotes better learning efficiency but also increases convergent stability.
Paper Structure (20 sections, 8 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Euclidean distance in the raw state space cannot measure the difficulty of achieving state transitions. (a) The Place-nut task. (b) A successful trajectory in the x-z space, where a square represents a state, and the number denotes the timestep index of the state.
  • Figure 2: The proposed learning framework, where the TDRP model and the policy are simultaneously learned.
  • Figure 3: Robot manipulation tasks used in the experiment section. Subfigures (a)(b)(c)(d) illustrate the tasks in Robosuite benchmark. Subfigures (e)(f)(g) illustrate the tasks in the Factory benchmark.
  • Figure 4: Experiment results in the Robosuite tasks. The videos for the learned policies are provided in https://sites.google.com/view/transition-distance-rp/tdrp.
  • Figure 5: Experiment results in the Factory tasks. (a) Pick-nut task. (b) Place-nut-on-bolt task. (c) Screw-nut task.
  • ...and 1 more figures