Table of Contents
Fetching ...

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu, Zhengguo Li, Ping Luo

TL;DR

Auto-Bench introduces a scalable, automated pipeline to benchmark Vision-Language Models against human capacities and values by using LLMs as data curators to auto-generate vast, human-aligned QA data from rich visual symbolizations and LLMs as judges to evaluate VLM outputs. The pipeline yields 28.5K human-verified and 3.504M raw triplets across four abilities and 16 sub-skills, enabling comprehensive, capacity-oriented evaluation and the possibility of supervised finetuning. Empirical results show high agreement between LLM judges and human judgments, reveal strengths and weaknesses across eight prevalent VLMs, and demonstrate substantial gains from supervised finetuning on the generated data. Auto-Bench thus provides a flexible, scalable resource for evaluating and guiding the development of VLMs, with implications for governance, safety, and alignment with human values.

Abstract

With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

TL;DR

Auto-Bench introduces a scalable, automated pipeline to benchmark Vision-Language Models against human capacities and values by using LLMs as data curators to auto-generate vast, human-aligned QA data from rich visual symbolizations and LLMs as judges to evaluate VLM outputs. The pipeline yields 28.5K human-verified and 3.504M raw triplets across four abilities and 16 sub-skills, enabling comprehensive, capacity-oriented evaluation and the possibility of supervised finetuning. Empirical results show high agreement between LLM judges and human judgments, reveal strengths and weaknesses across eight prevalent VLMs, and demonstrate substantial gains from supervised finetuning on the generated data. Auto-Bench thus provides a flexible, scalable resource for evaluating and guiding the development of VLMs, with implications for governance, safety, and alignment with human values.

Abstract

With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.
Paper Structure (65 sections, 7 figures, 8 tables)

This paper contains 65 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of Auto-Bench pipeline for benchmarking VLMs' alignment with human. First, we symbolize the images via various structured annotations coupled with specific curation requirements. Then we prompt the LLM to generate questions, answers, and chain-of-thought reasoning triplets for both quantitative and qualitative evaluation.
  • Figure 2: Data samples of Auto-Bench, which covers four evaluation dimensions including perception, reasoning, planning, and value alignment. Each dimension contains several sub-skills. For additional examples, please refer to \ref{['app:details:example']}.
  • Figure 3: Comparative analysis of question length and diversity between Auto-Bench with multiple public datasets. For the analysis of answers, please refer to \ref{['app:data:distribution']}.
  • Figure 4: Performance comparisons of various VLMs across different sub-skills via radar charts.
  • Figure 5: Box plot of the judgment correctness across open-set questions across various capacities.
  • ...and 2 more figures