Feedback-Driven Vision-Language Alignment with Minimal Human Supervision
Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, Aleix Martinez
TL;DR
This work tackles the data- and annotation- hungry nature of vision-language models by introducing Sampling-based Visual Projection (SVP), a feedback-driven framework that uses a small seed set, self-captioning, and an external grounding model to improve cross-modal alignment without paired image-text data. SVP operates through an inner-loop sampling process that generates latent Visual Projections conditioned on grounding feedback, a scoring mechanism that quantifies alignment gains, and an outer-loop adaptation (LoRA) that updates the base model using the most informative samples. Across ten vision-language benchmarks spanning captioning, referring expressions, VQA, multitasking, and hallucination control, SVP delivers substantial improvements, including strong captioning gains (~20%), enhanced referring capabilities, reduced hallucinations, and improved object recall, often matching or approaching the performance of larger models with far fewer parameters. The method demonstrates that targeted grounding feedback can elicit latent capabilities in VLMs, enabling more reliable, contextually grounded outputs and broad applicability across architectures with limited annotated data. These results highlight the potential for data-efficient, feedback-driven alignment to scale VLM capabilities for real-world deployment where extensive labeled data is impractical.
Abstract
Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Sampling-based Visual Projection), a novel framework that enhances vision-language alignment without relying on manually curated text-image pairs or preference annotation. SVP leverages a small set of manually selected images, self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14 % average improvement in captioning tasks, up to 12 % increase in object recall, and significantly reduced hallucinations, while maintaining question-answering capabilities. Using SVP, a small VLM achieves hallucination reductions similar to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.
