IDAT: A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents

Shrestha Mohanty; Negar Arabzadeh; Andrea Tupini; Yuxuan Sun; Alexey Skrynnik; Artem Zholus; Marc-Alexandre Côté; Julia Kiseleva

IDAT: A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents

Shrestha Mohanty, Negar Arabzadeh, Andrea Tupini, Yuxuan Sun, Alexey Skrynnik, Artem Zholus, Marc-Alexandre Côté, Julia Kiseleva

TL;DR

IDAT introduces a multi-modal dataset and toolkit for building interactive task-solving agents in a Minecraft-like environment, addressing data scarcity and evaluation shortcomings in interactive grounded language understanding. It combines Seed and IGLU datasets— totaling around $8{,}947$ utterances and $1{,}182$ clarifying questions—plus a scalable data collection tool and a human-in-the-loop evaluation platform (Greenlands) to enable qualitative multi-turn assessment. Offline and online evaluations uncover limitations in current agents and reveal that metrics like $F_1$ can fail to capture interaction quality, underscoring the value of human-in-the-loop feedback for driving progress. The authors provide these resources openly under MIT license to accelerate research on interactive, instruction-following agents and suggest future work integrating large multi-modal models to bridge the gap to human-like dialogue capabilities.

Abstract

Seamless interaction between AI agents and humans using natural language remains a key goal in AI research. This paper addresses the challenges of developing interactive agents capable of understanding and executing grounded natural language instructions through the IGLU competition at NeurIPS. Despite advancements, challenges such as a scarcity of appropriate datasets and the need for effective evaluation platforms persist. We introduce a scalable data collection tool for gathering interactive grounded language instructions within a Minecraft-like environment, resulting in a Multi-Modal dataset with around 9,000 utterances and over 1,000 clarification questions. Additionally, we present a Human-in-the-Loop interactive evaluation platform for qualitative analysis and comparison of agent performance through multi-turn communication with human annotators. We offer to the community these assets referred to as IDAT (IGLU Dataset And Toolkit) which aim to advance the development of intelligent, interactive AI agents and provide essential resources for further research.

IDAT: A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents

TL;DR

utterances and

clarifying questions—plus a scalable data collection tool and a human-in-the-loop evaluation platform (Greenlands) to enable qualitative multi-turn assessment. Offline and online evaluations uncover limitations in current agents and reveal that metrics like

can fail to capture interaction quality, underscoring the value of human-in-the-loop feedback for driving progress. The authors provide these resources openly under MIT license to accelerate research on interactive, instruction-following agents and suggest future work integrating large multi-modal models to bridge the gap to human-like dialogue capabilities.

Abstract

Paper Structure (26 sections, 9 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 9 figures, 6 tables, 1 algorithm.

Introduction
Interactive Grounded Language Understanding (IGLU) Setup
Data Collection Tool
IDAT Dataset
Seed Dataset
IGLU-Dataset
IGLU Evaluation
Offline Evaluation
Human-in-the-Loop Interactive Online Evaluation: Greenlands Platform
Human Evaluation Results and Discussion
Related Work
Conclusion
Limitations
Comparison between related platforms
Data Collection Tool
...and 11 more sections

Figures (9)

Figure 1: Interactive Grounded Language Understanding (IGLU) Setup
Figure 2: (a) The architecture of the data collection tool. (b) The IGLU dataset collection pipeline.
Figure 3: Example of seed data collection, where the Architect can see the goal structure and provides instructions for the Builder. The blue arrows indicate turns for the first goal structure, the orange arrows indicate turns for the second goal structure. Annotators can switch roles between architect and builder for different structures.
Figure 4: MTurk view of the data collection tool.
Figure 5: The human participant is spawned in the Lobby world when they join the server. It's a flat world where the only action they're allowed to do is to paste a Join Code in the chat box.
...and 4 more figures

IDAT: A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents

TL;DR

Abstract

IDAT: A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (9)