VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu
TL;DR
The paper tackles the costly bottleneck of aligning large vision-language models with human preferences by introducing VLFeedback, a large-scale AI-annotated dataset built with GPT-4V that captures multi-aspect preferences (helpfulness, visual faithfulness, ethics) across diverse visual tasks. By applying direct preference optimization (DPO) to a Qwen-VL-Chat baseline, the authors train Silkie, which achieves consistent gains on perception, cognition, and safety benchmarks and shows resilience to red-teaming attacks. Compared to human-annotated baselines, AI-generated preferences yield broader, more robust improvements due to dataset scale and diversity, while enabling scalable data collection via AI annotation. The work also demonstrates the value of including red-teaming content and conducting data-scaling analyses to understand the trade-offs between data volume, cost, and alignment quality, suggesting a practical path for scalable LVLM alignment in real-world deployments.
Abstract
As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.
