Table of Contents
Fetching ...

MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

Gabriel Roccabruna, Olha Khomyn, Giuseppe Riccardi

TL;DR

Using MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs), six state-of-the-art LVLMs are evaluated across model scales, varying language context, multimodal input structure, and fine-tuning strategies.

Abstract

AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models' understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.

MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

TL;DR

Using MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs), six state-of-the-art LVLMs are evaluated across model scales, varying language context, multimodal input structure, and fine-tuning strategies.

Abstract

AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models' understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.
Paper Structure (16 sections, 1 equation, 8 figures, 4 tables)

This paper contains 16 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Example of an AI agent’s planning process and inherent uncertainties for a natural-language goal. QuestionsA–D decompose the high-level goal into executable actions, while E infers their Temporal Execution Order (TEO) as a directed acyclic graph. Each step introduces uncertainty, producing multiple possible paths, some correct, others wrong.
  • Figure 2: This figure shows the first two points of the annotation guidelines, that present the purpose and the general description of the task.
  • Figure 3: This section of the annotation guidelines offers annotators the option to watch a demonstration video that walks through the task and explains how to identify dependencies between steps. Following the video, annotators are provided with a textual description that formally defines each dependency, accompanied by illustrative examples.
  • Figure 4: This figure continues from the previous one, illustrating a case in which a step depends on another (Example 1.1), and presenting the formal definition of independent steps
  • Figure 5: This figure shows an example given to illustrate the case in which steps are independent.
  • ...and 3 more figures