Table of Contents
Fetching ...

Gamified crowd-sourcing of high-quality data for visual fine-tuning

Shashank Yadav, Rohan Tomar, Garvit Jain, Chirag Ahooja, Shubham Chaudhary, Charles Elkan

TL;DR

This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models and implements a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks.

Abstract

This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models. GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model's knowledge. Our contributions include (1) an approach to capture question-answer pairs from humans that directly address weaknesses in a model's knowledge, (2) a method for evaluating and rewarding players that successfully incentivizes them to provide high-quality submissions, and (3) a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks. Our implementation of GAP has significantly improved the accuracy of a small multimodal model, namely MiniCPM-Llama3-V-2.5-8B, increasing its GPT score from 0.147 to 0.477 on our dataset, approaching the benchmark set by the much larger GPT-4V. Moreover, we demonstrate that the data generated using MiniCPM-Llama3-V-2.5-8B also enhances its performance across other benchmarks, and exhibits cross-model benefits. Specifically, the same data improves the performance of QWEN2-VL-2B and QWEN2-VL-7B on the same multiple benchmarks.

Gamified crowd-sourcing of high-quality data for visual fine-tuning

TL;DR

This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models and implements a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks.

Abstract

This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models. GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model's knowledge. Our contributions include (1) an approach to capture question-answer pairs from humans that directly address weaknesses in a model's knowledge, (2) a method for evaluating and rewarding players that successfully incentivizes them to provide high-quality submissions, and (3) a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks. Our implementation of GAP has significantly improved the accuracy of a small multimodal model, namely MiniCPM-Llama3-V-2.5-8B, increasing its GPT score from 0.147 to 0.477 on our dataset, approaching the benchmark set by the much larger GPT-4V. Moreover, we demonstrate that the data generated using MiniCPM-Llama3-V-2.5-8B also enhances its performance across other benchmarks, and exhibits cross-model benefits. Specifically, the same data improves the performance of QWEN2-VL-2B and QWEN2-VL-7B on the same multiple benchmarks.
Paper Structure (31 sections, 12 equations, 7 figures, 8 tables)

This paper contains 31 sections, 12 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: For each image, the goal of the player is to find a question that the AI answers incorrectly: (a) player asks a question and the model answers correctly; (b) player tries again but the model answers right again; (c) finally, the model answers the third question wrong. There are 10 images in a session, 5 tainted and 5 untainted; players don't know which images are tainted. The player moves to the next image after 120 seconds or after choosing "Wrong Answer." At the end of the session, they get 20 points per tainted image where they chose "Wrong Answer" and the model had been instructed to answer incorrectly.
  • Figure 2: Weekly player growth on our Telegram miniapp.
  • Figure 3: (a) Sample image from the tainted dataset, a few easily identifiable items with basic lighting; (b) Sample from the untainted dataset, multiple people and items with complex lighting.
  • Figure 5: Figure illustrates the bimodal distribution of $P (M = 0 | H = 0, D = \text{Tainted})$, revealing that participants tend to cluster at either high or low accuracy levels, as evaluated on the tainted dataset, with fewer achieving mid-range performance
  • Figure : (a)
  • ...and 2 more figures