Table of Contents
Fetching ...

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

Quentin Gallouédec, Edward Beeching, Clément Romac, Emmanuel Dellandréa

TL;DR

Jack of All Trades (JAT) is a single transformer model designed to operate across NLP, CV, and RL with one set of weights. It interleaves observations/rewards and actions via per-timestep embeddings to greatly widen the attention window, and it introduces observation prediction as an auxiliary objective to improve learning. Trained on a diverse, open dataset and evaluated across Atari, BabyAI, MuJoCo, and Meta-World, JAT achieves competitive results while being substantially smaller and cheaper than the Gato baseline, and it benefits from the auxiliary observation prediction. The work demonstrates the feasibility of cross-domain generalist agents and provides open resources to spur further research, including nuanced handling of task indeterminacy and imitation learning improvements.

Abstract

The search for a general model that can operate seamlessly across multiple domains remains a key goal in machine learning research. The prevailing methodology in Reinforcement Learning (RL) typically limits models to a single task within a unimodal framework, a limitation that contrasts with the broader vision of a versatile, multi-domain model. In this paper, we present Jack of All Trades (JAT), a transformer-based model with a unique design optimized for handling sequential decision-making tasks and multi-modal data types. The JAT model demonstrates its robust capabilities and versatility by achieving strong performance on very different RL benchmarks, along with promising results on Computer Vision (CV) and Natural Language Processing (NLP) tasks, all using a single set of weights. The JAT model marks a significant step towards more general, cross-domain AI model design, and notably, it is the first model of its kind to be fully open-sourced at https://huggingface.co/jat-project/jat, including a pioneering general-purpose dataset.

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

TL;DR

Jack of All Trades (JAT) is a single transformer model designed to operate across NLP, CV, and RL with one set of weights. It interleaves observations/rewards and actions via per-timestep embeddings to greatly widen the attention window, and it introduces observation prediction as an auxiliary objective to improve learning. Trained on a diverse, open dataset and evaluated across Atari, BabyAI, MuJoCo, and Meta-World, JAT achieves competitive results while being substantially smaller and cheaper than the Gato baseline, and it benefits from the auxiliary observation prediction. The work demonstrates the feasibility of cross-domain generalist agents and provides open resources to spur further research, including nuanced handling of task indeterminacy and imitation learning improvements.

Abstract

The search for a general model that can operate seamlessly across multiple domains remains a key goal in machine learning research. The prevailing methodology in Reinforcement Learning (RL) typically limits models to a single task within a unimodal framework, a limitation that contrasts with the broader vision of a versatile, multi-domain model. In this paper, we present Jack of All Trades (JAT), a transformer-based model with a unique design optimized for handling sequential decision-making tasks and multi-modal data types. The JAT model demonstrates its robust capabilities and versatility by achieving strong performance on very different RL benchmarks, along with promising results on Computer Vision (CV) and Natural Language Processing (NLP) tasks, all using a single set of weights. The JAT model marks a significant step towards more general, cross-domain AI model design, and notably, it is the first model of its kind to be fully open-sourced at https://huggingface.co/jat-project/jat, including a pioneering general-purpose dataset.
Paper Structure (35 sections, 1 equation, 17 figures, 4 tables)

This paper contains 35 sections, 1 equation, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Architecture of the JAT network. For sequential decision-making tasks, observations and rewards on the one hand, and actions on the other, are encoded and interleaved. The model generates the next embedding autoregressively with a causal mask, and decodes according to expected modality.
  • Figure 2: JAT image captioning examples. The theme is usually correct, although the relevance is sometimes limited.
  • Figure 3: JAT text completion examples. The syntax is generally correct, the completion is on-topic, although the generated text may be wrong.
  • Figure 4: Aggregated expert normalized scores with 95% Confidence Intervals (CIs) for each RL domain as a function of learning step.
  • Figure 5: Human normalized scores for the JAT agent on the Atari 57 benchmark.
  • ...and 12 more figures