Table of Contents
Fetching ...

Natural Language-conditioned Reinforcement Learning with Inside-out Task Language Development and Translation

Jing-Cheng Pang, Xin-Yu Yang, Si-Hang Yang, Yang Yu

TL;DR

The paper introduces Inside-Out Learning (IOL) for natural language-conditioned reinforcement learning, replacing unbounded NL instructions with a task language (TL) expressed as object-predicate relations. It presents TALAR, a three-component system comprising a TL generator, an NL-to-TL translator (via a Variational Auto-Encoder), and an instruction-following policy trained with PPO, showing strong improvements and robustness to unseen NL expressions. TL serves not only to accelerate policy learning but also as a natural abstraction for hierarchical RL. The work demonstrates TL’s interpretability, enables better generalization, and provides a foundation for future dynamic dataset expansion and predicate-focused language representations.

Abstract

Natural Language-conditioned reinforcement learning (RL) enables the agents to follow human instructions. Previous approaches generally implemented language-conditioned RL by providing human instructions in natural language (NL) and training a following policy. In this outside-in approach, the policy needs to comprehend the NL and manage the task simultaneously. However, the unbounded NL examples often bring much extra complexity for solving concrete RL tasks, which can distract policy learning from completing the task. To ease the learning burden of the policy, we investigate an inside-out scheme for natural language-conditioned RL by developing a task language (TL) that is task-related and unique. The TL is used in RL to achieve highly efficient and effective policy training. Besides, a translator is trained to translate NL into TL. We implement this scheme as TALAR (TAsk Language with predicAte Representation) that learns multiple predicates to model object relationships as the TL. Experiments indicate that TALAR not only better comprehends NL instructions but also leads to a better instruction-following policy that improves 13.4% success rate and adapts to unseen expressions of NL instruction. The TL can also be an effective task abstraction, naturally compatible with hierarchical RL.

Natural Language-conditioned Reinforcement Learning with Inside-out Task Language Development and Translation

TL;DR

The paper introduces Inside-Out Learning (IOL) for natural language-conditioned reinforcement learning, replacing unbounded NL instructions with a task language (TL) expressed as object-predicate relations. It presents TALAR, a three-component system comprising a TL generator, an NL-to-TL translator (via a Variational Auto-Encoder), and an instruction-following policy trained with PPO, showing strong improvements and robustness to unseen NL expressions. TL serves not only to accelerate policy learning but also as a natural abstraction for hierarchical RL. The work demonstrates TL’s interpretability, enables better generalization, and provides a foundation for future dynamic dataset expansion and predicate-focused language representations.

Abstract

Natural Language-conditioned reinforcement learning (RL) enables the agents to follow human instructions. Previous approaches generally implemented language-conditioned RL by providing human instructions in natural language (NL) and training a following policy. In this outside-in approach, the policy needs to comprehend the NL and manage the task simultaneously. However, the unbounded NL examples often bring much extra complexity for solving concrete RL tasks, which can distract policy learning from completing the task. To ease the learning burden of the policy, we investigate an inside-out scheme for natural language-conditioned RL by developing a task language (TL) that is task-related and unique. The TL is used in RL to achieve highly efficient and effective policy training. Besides, a translator is trained to translate NL into TL. We implement this scheme as TALAR (TAsk Language with predicAte Representation) that learns multiple predicates to model object relationships as the TL. Experiments indicate that TALAR not only better comprehends NL instructions but also leads to a better instruction-following policy that improves 13.4% success rate and adapts to unseen expressions of NL instruction. The TL can also be an effective task abstraction, naturally compatible with hierarchical RL.
Paper Structure (28 sections, 3 equations, 15 figures, 5 tables, 3 algorithms)

This paper contains 28 sections, 3 equations, 15 figures, 5 tables, 3 algorithms.

Figures (15)

  • Figure 1: An illustration of OIL and IOL schemes in NLC-RL. Left: OIL directly exposes the NL instructions to the policy. Right: IOL develops a task language, which is task-related and a unique representation of NL instructions. The solid lines represent instruction following process, while the dashed lines represent TL development and translation.
  • Figure 2: Overall training process of task language development and translation. (a) The overall training process. (b) Network architecture of the TL generator. (c) Architecture of one predicate module. (d) Network architecture of the translator. The number of predicate modules, arguments and predicate networks can be adjusted according to the task scale.
  • Figure 3: A visualization of CLEVR-Robot environment in our experiments. (a) In the beginning, one NL instruction is randomly sampled as Can you move the cyan ball in front of the blue ball? Then agent executes actions to complete the instruction. (b) The task terminates if achieving the goal or reaching the maximum timestep.
  • Figure 4: The t-SNE representations of different types of NL encoding. Points with the same marker stand for the encoding of nine different NL expressions that describe the same human instruction. We add a slight noise to the overlapping points for better presentation. (a) The t-SNE representations of the TL output by the translator. (b) The encoding output by Bert model. (c) The encoding output by the language encoding layer of the OIL baseline (Bert-continuous in Section \ref{['sec:exp_ifp']}).
  • Figure 5: Frequency of five destination balls when a predicate network outputs a value of $1$. Each bar stands for the frequency of the ball with a certain colour.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Definition 1: Task dataset