Table of Contents
Fetching ...

TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding

Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Ken Fukuda, Teruko Mitamura

TL;DR

Procedural activity understanding requires aligning observed actions with described procedures in dynamic video settings. The authors introduce TAMA, a training-free Tool-Augmented Multimodal Agent that uses multimedia-returning tools to enable interleaved multimodal reasoning. On ProMQA-Assembly, TAMA yields model-dependent gains, notably improving GPT-5 and MiMo-VL, with ablations confirming the value of multimedia outputs and flexible tool use. This work advances thinking-with-images for video understanding and supports the development of capable procedural activity assistants.

Abstract

Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.

TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding

TL;DR

Procedural activity understanding requires aligning observed actions with described procedures in dynamic video settings. The authors introduce TAMA, a training-free Tool-Augmented Multimodal Agent that uses multimedia-returning tools to enable interleaved multimodal reasoning. On ProMQA-Assembly, TAMA yields model-dependent gains, notably improving GPT-5 and MiMo-VL, with ablations confirming the value of multimedia outputs and flexible tool use. This work advances thinking-with-images for video understanding and supports the development of capable procedural activity assistants.

Abstract

Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.

Paper Structure

This paper contains 30 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Left: Overview of existing approach. Right: Overview of our proposed approach, TAMA. Given a question as an initial input, a VLM-based agent generates its thought, followed by a tool call. Once a tool output is produced, the concatenation of the model output and the tool output is appended to the previous input to form the next input. Then, the model further generates either the next pair of a thought and tool call or an answer.
  • Figure 2: Tool usage pattern.
  • Figure 3: Performance of workflow vs agentic approach (TAMA). Each number represents one tool operation in the workflow approach: "1" is the uniform sampling, "2" is the instruction check, and "3" is the target assembly image check, and each digit sequence defines the execution order of the tools.
  • Figure 4: Prompt for frame selection in TCoT.
  • Figure 5: Prompt for answer generation in TCoT.
  • ...and 8 more figures