Table of Contents
Fetching ...

MARPLE: A Benchmark for Long-Horizon Inference

Emily Jin, Zhuoyi Huang, Jan-Philipp Fränken, Weiyu Liu, Hannah Cha, Erik Brockbank, Sarah Wu, Ruohan Zhang, Jiajun Wu, Tobias Gerstenberg

TL;DR

MARPLE introduces a benchmark for long-horizon, multimodal inference in two-agent household scenarios, deployed in a Gridworld-like environment extended from Mini-BEHAVIOR. Framed as a POMDP, the task requires predicting which agent caused a query state using visual, language, and audio evidence across time steps up to $T$. The authors compare Monte Carlo search with learned agent models, GPT-4, and human baselines, providing datasets, evaluation metrics, and a public codebase. Key findings show humans outperform both AI baselines, while multimodal simulation models benefit from combining modalities; GPT-4 struggles to converge on several tasks, highlighting gaps in current LLM-based inference for long-horizon events. MARPLE thus serves as a rigorous platform to study high-level, multimodal reasoning and to drive development of more robust, human-like inference systems.

Abstract

Reconstructing past events requires reasoning across long time horizons. To figure out what happened, we need to use our prior knowledge about the world and human behavior and draw inferences from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit'' stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both traditional Monte Carlo simulation methods and an LLM baseline (GPT-4) on this task. Compared to humans, traditional inference models are less robust and performant, while GPT-4 has difficulty comprehending environmental changes. We analyze what factors influence inference performance and ablate different modes of evidence, finding that all modes are valuable for performance. Overall, our experiments demonstrate that the long-horizon, multimodal inference tasks in our benchmark present a challenge to current models.

MARPLE: A Benchmark for Long-Horizon Inference

TL;DR

MARPLE introduces a benchmark for long-horizon, multimodal inference in two-agent household scenarios, deployed in a Gridworld-like environment extended from Mini-BEHAVIOR. Framed as a POMDP, the task requires predicting which agent caused a query state using visual, language, and audio evidence across time steps up to . The authors compare Monte Carlo search with learned agent models, GPT-4, and human baselines, providing datasets, evaluation metrics, and a public codebase. Key findings show humans outperform both AI baselines, while multimodal simulation models benefit from combining modalities; GPT-4 struggles to converge on several tasks, highlighting gaps in current LLM-based inference for long-horizon events. MARPLE thus serves as a rigorous platform to study high-level, multimodal reasoning and to drive development of more robust, human-like inference systems.

Abstract

Reconstructing past events requires reasoning across long time horizons. To figure out what happened, we need to use our prior knowledge about the world and human behavior and draw inferences from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit'' stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both traditional Monte Carlo simulation methods and an LLM baseline (GPT-4) on this task. Compared to humans, traditional inference models are less robust and performant, while GPT-4 has difficulty comprehending environmental changes. We analyze what factors influence inference performance and ablate different modes of evidence, finding that all modes are valuable for performance. Overall, our experiments demonstrate that the long-horizon, multimodal inference tasks in our benchmark present a challenge to current models.
Paper Structure (37 sections, 1 equation, 14 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 1 equation, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustrative example of an inference task in MARPLE: a "whodunit"-inspired benchmark for long-horizon inference. Given a query state change, the challenge is to decide which agent caused the change by leveraging visual, text, and/or audio evidence of both agents A and B up to some timestep $t$. The inference accuracy, probability of choosing the correct agent, is calculated at every timestep and used to evaluate performance.
  • Figure 2: MARPLE Household Simulator (backend). The simulator contains a list of pre-defined Missions, each mission consists of a list of Subgoals, and each subgoal is a representation of a Action-State_change-Object-Furniture-Room combination. Given the mission definition and corresponding environment configuration file, we can procedurally generate the environment.
  • Figure 3: A hierarchical planner for procedural generation of agent behaviors. A high-level planner samples a mission, a finite state machine breaks it into subgoals, and a low-level planner determines an action sequence.
  • Figure 4: Performance for each baseline across scenarios. Results are included for the simulation baseline trained both in-distribution and out-of-distribution (ood). Inference scenarios are presented in order of increasing difficulty from left to right, top to bottom. Error bands correspond to 95% CI intervals across tested trajectories.
  • Figure 5: Example rollouts performed by our simulation model, starting from the initial state to possible future states. For agent A, this rollout reaches the inference state: Pickup(plant).
  • ...and 9 more figures