Table of Contents
Fetching ...

LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein

TL;DR

LMAct confronts the challenge of in-context imitation learning under extremely long multimodal contexts by evaluating frontier LMs across six interactive tasks with up to 512 demonstrations in contexts up to $1\times 10^6$ tokens. The study systematically analyzes observation formats (text vs images), prompting strategies (including chain-of-thought), and the stability of in-context learning, while providing an open-source benchmark for zero-, few-, and many-shot evaluation. Across tasks, models frequently fail to reach expert performance, with some tasks showing only modest gains from more demonstrations and others showing no improvement, highlighting a persistent knowledge-doing gap in long-context multimodal decision-making. The LMAct benchmark offers a controlled, extensible platform to diagnose these limitations and guide future research toward truly capable autonomous agents.

Abstract

In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context $\unicode{x2013}$ from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.

LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

TL;DR

LMAct confronts the challenge of in-context imitation learning under extremely long multimodal contexts by evaluating frontier LMs across six interactive tasks with up to 512 demonstrations in contexts up to tokens. The study systematically analyzes observation formats (text vs images), prompting strategies (including chain-of-thought), and the stability of in-context learning, while providing an open-source benchmark for zero-, few-, and many-shot evaluation. Across tasks, models frequently fail to reach expert performance, with some tasks showing only modest gains from more demonstrations and others showing no improvement, highlighting a persistent knowledge-doing gap in long-context multimodal decision-making. The LMAct benchmark offers a controlled, extensible platform to diagnose these limitations and guide future research toward truly capable autonomous agents.

Abstract

In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.

Paper Structure

This paper contains 44 sections, 30 figures, 16 tables.

Figures (30)

  • Figure 1: LMAct overview. Our multimodal benchmark consists of six decision-making tasks that come with an expert policy and potentially multiple state representations. For evaluation, LM performance is measured on test episodes with unseen initial states. LMs are conditioned on a generic decision-making preamble (fixed across all tasks), followed by $0$ to $N$ demonstration episodes, and a separator that indicates the start of the current episode ($N$ can be up to $512$, with up to $100$ steps per episode, both depending on the task. The maximum context length is $1$M tokens). In each step of the test episode an action is generated by the LM's predicted continuation of the context. The resulting environment interaction produces the next observation that is added to the growing context of state action pairs.
  • Figure 2: Best scores per model and task across all observation formats, numbers of demonstration episodes, and ablations (chain-of-thought, showing legal actions). Accordingly, different bars in a panel may be based on different settings. The expert policy (which produced the demonstrations) is an upper baseline. The lower baseline randomly selects a legal action at each step. Claude 3.5 Sonnet, o1-mini, and o1-preview cannot be evaluated on Atari -- Phoenix because they cannot process (enough, for Claude 3.5 Sonnet) images.
  • Figure 3: In-context imitation learning on Atari -- Phoenix (RGB observations). Almost all models benefit (mildly) from one demonstration episode, but not from more (GPT-4o and o1 cannot fit multiple demonstration episodes in the context). While no model outperforms the random baseline, Gemini 1.5 Flash performs best.
  • Figure 4: In-context imitation learning on chess against the weakest variant of Stockfish (level $0$, $\approx1300$ Elo), further restricted to one node. The models almost always lose (i.e., score $-1$) and do not benefit from more demonstrations. The PGN observations enable the best results, in particular for GPT-4o, which performs best but still loses majority of games against this weak opponent.
  • Figure 5: In-context imitation learning on $7\times7$ crossword puzzles (using clues with the simplest rating) with ASCII observations. The performance of most models is largely unaffected by the number of expert demonstration episodes. o1-preview and o1 solve most crosswords, while other models struggle to varying degrees.
  • ...and 25 more figures