Table of Contents
Fetching ...

ViSTa Dataset: Do vision-language models understand sequential tasks?

Evžen Wybitul, Evan Ryan Gunter, Mikhail Seleznyov, David Lindner

TL;DR

This work uses ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks, to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o and finds that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.

Abstract

Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.

ViSTa Dataset: Do vision-language models understand sequential tasks?

TL;DR

This work uses ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks, to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o and finds that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.

Abstract

Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.

Paper Structure

This paper contains 32 sections, 1 equation, 17 figures, 8 tables.

Figures (17)

  • Figure 1: ViSTa is a hierarchical dataset of videos with step-by-step descriptions. ViSTa enables granular testing of task sequences in three different environments. Tasks are organized by number of sub-tasks into a hierarchy of 8 levels. Videos within levels are grouped into problem sets (\ref{['fig:problem-example']}) testing specific capabilities.
  • Figure 2: A problem set for action-order understanding. Problem sets are groups of videos to be matched with their descriptions. Each set targets a specific capability, e.g. understanding action order.
  • Figure 3: Macro F1 score averaged over groups of problem sets in level 1 (\ref{['fig:results-single-action']}) and in higher levels (\ref{['fig:results-levels']}). Error ranges are 95% C.I.
  • Figure 4: Understanding action order in long videos is hard. In permutation problems, which focus on testing action-order understanding, gpt-4o's performance starts dropping after level 4, ending up at around 50% of its original value for videos with 8 actions. This is not great performance, considering that the majority class predictor baseline is quite high already. The other two models are barely above the baseline. Notably, we do not observe this behavior in the general problem sets. Despite the baseline being lower there, the models all have higher performance, and --- except for v i clip --- retain it from level 2 all the way through level 8. Error ranges are 95% C.I.
  • Figure 5: Frame rate and model scale play an important role in general sequential tasks. We see that the performance of clip rises with increasing frame rate. When we compare clip-8 and v i clip, which both get 8 frames, we see that clip-8 nevertheless does much better. This is likely due to its larger scale, since other differences between the models are minimal. Ranges are 95% C.I.
  • ...and 12 more figures