Table of Contents
Fetching ...

Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks

Rushiv Arora

TL;DR

This paper introduces Lexical Policy Networks (LEXPOL), a language-conditioned gating mechanism over a mixture of sub-policies for multi-task reinforcement learning. By encoding task metadata with a pre-trained language model and learning a gating MLP, LEXPOL end-to-end combines fundamental skills to solve diverse tasks within a single policy, validated on MetaWorld MT10/MT50 where it matches or exceeds strong baselines and demonstrates sample efficiency. A key finding is that language-conditioned gates can compose independently trained expert policies to handle novel task descriptions and task combinations, highlighting the potential of natural-language metadata to index and recombine reusable skills. The work also demonstrates a hybrid approach (LEXPOL+CARE) that jointly leverages state and policy context to further improve performance, suggesting broad practical impact for scalable, modular multi-task RL.

Abstract

Multi-task reinforcement learning often relies on task metadata -- such as brief natural-language descriptions -- to guide behavior across diverse objectives. We present Lexical Policy Networks (LEXPOL), a language-conditioned mixture-of-policies architecture for multi-task RL. LEXPOL encodes task metadata with a text encoder and uses a learned gating module to select or blend among multiple sub-policies, enabling end-to-end training across tasks. On MetaWorld benchmarks, LEXPOL matches or exceeds strong multi-task baselines in success rate and sample efficiency, without task-specific retraining. To analyze the mechanism, we further study settings with fixed expert policies obtained independently of the gate and show that the learned language gate composes these experts to produce behaviors appropriate to novel task descriptions and unseen task combinations. These results indicate that natural-language metadata can effectively index and recombine reusable skills within a single policy.

Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks

TL;DR

This paper introduces Lexical Policy Networks (LEXPOL), a language-conditioned gating mechanism over a mixture of sub-policies for multi-task reinforcement learning. By encoding task metadata with a pre-trained language model and learning a gating MLP, LEXPOL end-to-end combines fundamental skills to solve diverse tasks within a single policy, validated on MetaWorld MT10/MT50 where it matches or exceeds strong baselines and demonstrates sample efficiency. A key finding is that language-conditioned gates can compose independently trained expert policies to handle novel task descriptions and task combinations, highlighting the potential of natural-language metadata to index and recombine reusable skills. The work also demonstrates a hybrid approach (LEXPOL+CARE) that jointly leverages state and policy context to further improve performance, suggesting broad practical impact for scalable, modular multi-task RL.

Abstract

Multi-task reinforcement learning often relies on task metadata -- such as brief natural-language descriptions -- to guide behavior across diverse objectives. We present Lexical Policy Networks (LEXPOL), a language-conditioned mixture-of-policies architecture for multi-task RL. LEXPOL encodes task metadata with a text encoder and uses a learned gating module to select or blend among multiple sub-policies, enabling end-to-end training across tasks. On MetaWorld benchmarks, LEXPOL matches or exceeds strong multi-task baselines in success rate and sample efficiency, without task-specific retraining. To analyze the mechanism, we further study settings with fixed expert policies obtained independently of the gate and show that the learned language gate composes these experts to produce behaviors appropriate to novel task descriptions and unseen task combinations. These results indicate that natural-language metadata can effectively index and recombine reusable skills within a single policy.

Paper Structure

This paper contains 20 sections, 2 figures, 12 tables, 1 algorithm.

Figures (2)

  • Figure 1: LEXPOL Architecture. There are three primary components: 1) A Context Encoder, 2) Mixture of Policies, 3) Gating MLP. Natural language metadata is used to encode a context while a mixture of policies, representing smaller skills, generate a series of actions. The Gating MLP is used to generate a soft attention over the policy outputs using the encoded context, resulting in a final action. Each policy represents a smaller skill that when combined with other skills can used to solve multiple longer horizon tasks.
  • Figure 2: A comparison of LEXPOL with pre-trained frozen policies (top image) and end-to-end trained policies (bottom image). Two goals are selected, go left (go to the blue goal) and go right (go to the red goal). In the top image with pre-trained policies, two separate policies are trained corresponding to each task, their parameters are frozen, and then LEXPOL is trained on a new task (go to the red goal then the blue goal). In the bottom image, LEXPOL is trained end-to-end, and then trained on the new task with the policies frozen. In both cases, we see the usage of factorized skills that enable multi-task reinforcement learning on longer-horizons using natural language context. Interestingly, it is also inferred by the similarity that the end-to-end training results in the two pre-trained skills.