Table of Contents
Fetching ...

Your Co-Workers Matter: Evaluating Collaborative Capabilities of Language Models in Blocks World

Guande Wu, Chen Zhao, Claudio Silva, He He

TL;DR

The paper tackles how language-model–based agents can collaborate with humans or other agents in equal roles, a setting demanding intent understanding and coordination. It introduces CoBlock, a flexible two-agent blocks-world environment with independent, skill-dependent, and goal-dependent tasks, and employs a four-step prompting pipeline with chain-of-thought reasoning, partner-state modeling, and self-reflection to ground and coordinate behavior. Empirical results from both human-machine and machine-machine experiments show that grounding remains strong in single-agent tasks, while incorporating partner-state modeling and self-reflection yields about a 30 percentage-point increase in task completion and improves workload balance in collaborative scenarios. This work provides a practical testbed and prompting strategies that advance multi-agent collaboration with LLMs, highlighting how ToM-inspired reasoning and interactive dialogue can enhance cooperative AI in real-world-like tasks.

Abstract

Language agents that interact with the world on their own have great potential for automating digital tasks. While large language model (LLM) agents have made progress in understanding and executing tasks such as textual games and webpage control, many real-world tasks also require collaboration with humans or other LLMs in equal roles, which involves intent understanding, task coordination, and communication. To test LLM's ability to collaborate, we design a blocks-world environment, where two agents, each having unique goals and skills, build a target structure together. To complete the goals, they can act in the world and communicate in natural language. Under this environment, we design increasingly challenging settings to evaluate different collaboration perspectives, from independent to more complex, dependent tasks. We further adopt chain-of-thought prompts that include intermediate reasoning steps to model the partner's state and identify and correct execution errors. Both human-machine and machine-machine experiments show that LLM agents have strong grounding capacities, and our approach significantly improves the evaluation metric.

Your Co-Workers Matter: Evaluating Collaborative Capabilities of Language Models in Blocks World

TL;DR

The paper tackles how language-model–based agents can collaborate with humans or other agents in equal roles, a setting demanding intent understanding and coordination. It introduces CoBlock, a flexible two-agent blocks-world environment with independent, skill-dependent, and goal-dependent tasks, and employs a four-step prompting pipeline with chain-of-thought reasoning, partner-state modeling, and self-reflection to ground and coordinate behavior. Empirical results from both human-machine and machine-machine experiments show that grounding remains strong in single-agent tasks, while incorporating partner-state modeling and self-reflection yields about a 30 percentage-point increase in task completion and improves workload balance in collaborative scenarios. This work provides a practical testbed and prompting strategies that advance multi-agent collaboration with LLMs, highlighting how ToM-inspired reasoning and interactive dialogue can enhance cooperative AI in real-world-like tasks.

Abstract

Language agents that interact with the world on their own have great potential for automating digital tasks. While large language model (LLM) agents have made progress in understanding and executing tasks such as textual games and webpage control, many real-world tasks also require collaboration with humans or other LLMs in equal roles, which involves intent understanding, task coordination, and communication. To test LLM's ability to collaborate, we design a blocks-world environment, where two agents, each having unique goals and skills, build a target structure together. To complete the goals, they can act in the world and communicate in natural language. Under this environment, we design increasingly challenging settings to evaluate different collaboration perspectives, from independent to more complex, dependent tasks. We further adopt chain-of-thought prompts that include intermediate reasoning steps to model the partner's state and identify and correct execution errors. Both human-machine and machine-machine experiments show that LLM agents have strong grounding capacities, and our approach significantly improves the evaluation metric.
Paper Structure (33 sections, 3 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 33 sections, 3 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Task Setting: A human agent (Amy) and an LLM agent (Bob) collaborate on building the block structure with diverse goals and inventories. Sample Task Process: In the shown task, Amy's goal relies on Bob's, so they have to coordinate. To succeed on this task, Amy and Bob have to 1) communicate their goals and figure out the immediate plan to complete; 2) Amy place the yellow blocks to complete the immediate plan; 3) Amy and Bob coordinate to complete the remaining part of their goals.
  • Figure 2: Three different collaboration tasks with increasing levels of coordination. Top: The independent tasks that require little coordination between agents; Middle: The skill-dependent tasks that at least one goal requires both agents to complete; Bottom: The goal-dependent tasks that one agent's goal depends on prior completion of the partner’s goal.
  • Figure 3: World State consists of the agent's goal, currently built structure, dialogue, and action history. Prompt Text consists of four steps: 1) Analyze the XML world state and summarize the useful information; 2) Infer both the agent and the partner's state; 3) Self-reflection which identifies the errors and adjusts the communication strategies; 4) Predict the action. We use the CoT prompts in all steps.
  • Figure 4: Single-agent experiment settings including three parts. We represent the blocks by the XML structure and the textual description. Part I: describe the given XML into textual descriptions. Part II: convert the XML into a sequence of commands. Part III: directly convert the textual description into a sequence of commands.
  • Figure 5: Experiment results on single-agent experiments (Part I, II, III). LLM agents successfully complete almost all tasks.
  • ...and 4 more figures