LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations
Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein
TL;DR
LMAct confronts the challenge of in-context imitation learning under extremely long multimodal contexts by evaluating frontier LMs across six interactive tasks with up to 512 demonstrations in contexts up to $1\times 10^6$ tokens. The study systematically analyzes observation formats (text vs images), prompting strategies (including chain-of-thought), and the stability of in-context learning, while providing an open-source benchmark for zero-, few-, and many-shot evaluation. Across tasks, models frequently fail to reach expert performance, with some tasks showing only modest gains from more demonstrations and others showing no improvement, highlighting a persistent knowledge-doing gap in long-context multimodal decision-making. The LMAct benchmark offers a controlled, extensible platform to diagnose these limitations and guide future research toward truly capable autonomous agents.
Abstract
In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context $\unicode{x2013}$ from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.
