Table of Contents
Fetching ...

OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation

Paweł Budzianowski, Emilia Wiśnios, Gracjan Góral, Igor Kulakov, Viktor Petrenko, Krzysztof Walas

TL;DR

OpenGVL introduces an open benchmark for temporal progress prediction in robotics, built on the GVL framework and evaluated across diverse tasks. It demonstrates a substantial performance gap between open-source and proprietary VLMs, even as model scale improves, and shows practical utility for automated data curation and quality assessment on large robotics datasets. The work provides an evaluation space (OpenGVL Space), extensive experimental setups, and analyses of data issues (task definition, labeling ambiguity, OOD) to support robust data-driven robotics curation. Overall, OpenGVL offers a scalable, open pathway to improve temporal reasoning in robotics and to curate wild data for large-scale robotic learning.

Abstract

Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately $70\%$ of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.

OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation

TL;DR

OpenGVL introduces an open benchmark for temporal progress prediction in robotics, built on the GVL framework and evaluated across diverse tasks. It demonstrates a substantial performance gap between open-source and proprietary VLMs, even as model scale improves, and shows practical utility for automated data curation and quality assessment on large robotics datasets. The work provides an evaluation space (OpenGVL Space), extensive experimental setups, and analyses of data issues (task definition, labeling ambiguity, OOD) to support robust data-driven robotics curation. Overall, OpenGVL offers a scalable, open pathway to improve temporal reasoning in robotics and to curate wild data for large-scale robotic learning.

Abstract

Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.

Paper Structure

This paper contains 14 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Cumulative number of shared datasets for the LeRobot tag on the HF Datasets Hub.
  • Figure 2: Trajectory prediction performance comparison on the hidden human task. Models were tasked with predicting task completion percentages from shuffled trajectory inputs. The predicted scores were then sorted by ground truth values for visualization. Top: Gemini-2.5-Pro shows signs of monotonic upward trend. Bottom: Gemma-3-27B-it shows minimal predictive alignment indicating difficulty in discerning task completion patterns from visual trajectory data.
  • Figure 3: In hidden tasks 1 and 2, zero-shot VOC clusters performed at or below chance levels, indicating poor cold-start grounding capabilities. While two-shot prompting generally improved VOC scores, many remained weak (approximately 0.1--0.3), with only a minority achieving moderate performance ($\geq$ 0.4) and very few reaching strong performance levels ($\geq$ 0.7). This suggests that these tasks remain challenging overall, and while few-shot prompting provides some benefit, it is often insufficient on its own to achieve robust performance.
  • Figure 4: OpenGVL Benchmark Space and interactive analysis of different models and datasets.
  • Figure 5: Example of different datasets published by the community and analyzed in Section \ref{['data_curation']}.
  • ...and 5 more figures