Table of Contents
Fetching ...

AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao

Abstract

With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?

AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Abstract

With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?
Paper Structure (33 sections, 19 figures, 6 tables)

This paper contains 33 sections, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Agent observations and corresponding action plans in AsgardBench. Below each image is the action plan generated from that observation. This illustrates how AsgardBench requires agents to update or change their plans based on new visual evidence rather than following a fixed action sequence.
  • Figure 2: Success rates for each model under image based and Text-Only conditions. Visual input substantially improves performance for all but the weakest models, confirming that AsgardBench requires perception-conditioned reasoning. Agents can't rely on memorized action templates or detailed feedback. Text-Only performance remains low across models, in contrast to prior embodied benchmarks where Text-Only agents can perform competitively.
  • Figure 3: Relationship between task difficulty and the range of steps required for successful completion. We found that tasks that require longer or more variable action sequences tend to have lower success rates, indicating that models struggle with long-horizon dependencies and conditional branching. Each point represents one of the 108 tasks in AsgardBench.
  • Figure 4: Effect of feedback type on model performance. Removing success/failure signals (No Feedback) reduces accuracy, while providing detailed error messages (Detailed Feedback) sharply increases performance, including for Text-Only agents.
  • Figure 5: Model performance (blue) and percentage of undoable actions (orange). Higher performing models produce fewer undoable actions, while weaker models generate a larger fraction of actions that cannot be executed. This alignment between success rate and undoable action frequency suggests how difficulties in state tracking and plan adjustment contribute directly to overall task failure.
  • ...and 14 more figures