Table of Contents
Fetching ...

Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search

Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, Liang Lin

TL;DR

This work formalizes Human-Centric Open-Future Task Discovery (HOTD) to identify tasks that reduce human effort across uncertain futures, introducing HOTD-Bench for real-world video-based evaluation and CMAST as a scalable, multi-agent search-tree framework. HOTD-Bench combines a simulation-based protocol and open-vocabulary labels to assess potential tasks beyond observed trajectories. CMAST decomposes complex reasoning across specialized agents and a structured search tree, enabling robust task discovery and integration with diverse LMMs, achieving superior Valid Task Ratio and competitive Valid Task Count. The combined approach advances anticipatory, human-aligned assistance in dynamic, open-ended environments and provides a scalable evaluation platform for future embodied AI systems.

Abstract

Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across plausible futures. To facilitate this study, we propose HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.

Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search

TL;DR

This work formalizes Human-Centric Open-Future Task Discovery (HOTD) to identify tasks that reduce human effort across uncertain futures, introducing HOTD-Bench for real-world video-based evaluation and CMAST as a scalable, multi-agent search-tree framework. HOTD-Bench combines a simulation-based protocol and open-vocabulary labels to assess potential tasks beyond observed trajectories. CMAST decomposes complex reasoning across specialized agents and a structured search tree, enabling robust task discovery and integration with diverse LMMs, achieving superior Valid Task Ratio and competitive Valid Task Count. The combined approach advances anticipatory, human-aligned assistance in dynamic, open-ended environments and provides a scalable evaluation platform for future embodied AI systems.

Abstract

Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across plausible futures. To facilitate this study, we propose HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.

Paper Structure

This paper contains 19 sections, 8 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The illustration of HOTD. Driven by an overall goal, humans often engage in concurrent sub-processes, resulting in multiple possible future branches. HOTD aims to identify tasks that remain helpful across diverse and uncertain futures. For example, as highlighted by the green box, completing wipe the table in advance saves human effort regardless of the order of other steps.
  • Figure 2: The process of simulation-based evaluation (upper) & annotation pipeline (lower). The simulator first takes a discovered task $\hat{y}_n$, a historical action sequence and its associated goal $z$ as input. Then, it simulates the resulting future trajectory $A^{\prime}_z$ by accounting for the adjusted human actions until the goal. Finally, it summarizes the overall process to estimate the incurred cost $\mathcal{L}(A^{\prime}_z, z)$ and compares it to the original cost $\mathcal{L}(A_z, z)$. In the annotation pipeline, future actions are first selected to meet the helpful principle, then expanded into descriptive sentences and filtered through the non-disruptive principle and the executable principle, finally forming the task labels.
  • Figure 3: The overview of the Collaborative Multi-Agent Search Tree framework. It structures the HOTD reasoning with 7 LMM agents and a scalable search tree module.
  • Figure 4: Human evaluation of the simulator. Columns indicate how many annotators rated each task as helpful, where 0 means all rated it unhelpful and 5 means all rated it helpful. The distribution shows strong agreement between the simulator and human preferences.
  • Figure 5: The ablation studies on the search tree module and the component agents. We show the results given 40 sec of observations in the TSU, evaluated by simulator (left) and labels (right).
  • ...and 4 more figures