Table of Contents
Fetching ...

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu

TL;DR

The paper tackles the costly bottleneck of aligning large vision-language models with human preferences by introducing VLFeedback, a large-scale AI-annotated dataset built with GPT-4V that captures multi-aspect preferences (helpfulness, visual faithfulness, ethics) across diverse visual tasks. By applying direct preference optimization (DPO) to a Qwen-VL-Chat baseline, the authors train Silkie, which achieves consistent gains on perception, cognition, and safety benchmarks and shows resilience to red-teaming attacks. Compared to human-annotated baselines, AI-generated preferences yield broader, more robust improvements due to dataset scale and diversity, while enabling scalable data collection via AI annotation. The work also demonstrates the value of including red-teaming content and conducting data-scaling analyses to understand the trade-offs between data volume, cost, and alignment quality, suggesting a practical path for scalable LVLM alignment in real-world deployments.

Abstract

As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

TL;DR

The paper tackles the costly bottleneck of aligning large vision-language models with human preferences by introducing VLFeedback, a large-scale AI-annotated dataset built with GPT-4V that captures multi-aspect preferences (helpfulness, visual faithfulness, ethics) across diverse visual tasks. By applying direct preference optimization (DPO) to a Qwen-VL-Chat baseline, the authors train Silkie, which achieves consistent gains on perception, cognition, and safety benchmarks and shows resilience to red-teaming attacks. Compared to human-annotated baselines, AI-generated preferences yield broader, more robust improvements due to dataset scale and diversity, while enabling scalable data collection via AI annotation. The work also demonstrates the value of including red-teaming content and conducting data-scaling analyses to understand the trade-offs between data volume, cost, and alignment quality, suggesting a practical path for scalable LVLM alignment in real-world deployments.

Abstract

As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.

Paper Structure

This paper contains 45 sections, 3 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: VLFeedback dataset construction framework. We collect instructions from various sources and decode the corresponding responses using models randomly sampled from the pool. The GPT-4V assesses these responses regarding three aspects, providing ratings and rationales for the scores.
  • Figure 2: Rating distribution of different aspects. Helpfulness and Visual Faithfulness share similar score distributions. The red-teaming subset has a great portion of samples that are perceived to be unsafe.
  • Figure 3: Relative performance gain comparison between the RLHF-V dataset and our VLFeedback.
  • Figure 4: Impact of varying VLFeedback ratios on model performance. Performance plateaus with insufficient preference pairs (ratio < 0.2) but improves significantly without diminishing returns at higher ratios.
  • Figure 5: Case studies on evaluation samples from MMHal-Bench (left), MM-Vet (middle) and RTVLM (right). Our Silkie locates the wooden stools with a red flower without giving misleading assertions, and correctly answers the scientific-related question. After RT DPO, Silkie$_\text{RT}$ refuses to answer for a malicious jailbreaking query.
  • ...and 5 more figures