Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

Giorgio Giannone; Ruoteng Li; Qianli Feng; Evgeny Perevodchikov; Rui Chen; Aleix Martinez

Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, Aleix Martinez

TL;DR

This work tackles the data- and annotation- hungry nature of vision-language models by introducing Sampling-based Visual Projection (SVP), a feedback-driven framework that uses a small seed set, self-captioning, and an external grounding model to improve cross-modal alignment without paired image-text data. SVP operates through an inner-loop sampling process that generates latent Visual Projections conditioned on grounding feedback, a scoring mechanism that quantifies alignment gains, and an outer-loop adaptation (LoRA) that updates the base model using the most informative samples. Across ten vision-language benchmarks spanning captioning, referring expressions, VQA, multitasking, and hallucination control, SVP delivers substantial improvements, including strong captioning gains (~20%), enhanced referring capabilities, reduced hallucinations, and improved object recall, often matching or approaching the performance of larger models with far fewer parameters. The method demonstrates that targeted grounding feedback can elicit latent capabilities in VLMs, enabling more reliable, contextually grounded outputs and broad applicability across architectures with limited annotated data. These results highlight the potential for data-efficient, feedback-driven alignment to scale VLM capabilities for real-world deployment where extensive labeled data is impractical.

Abstract

Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Sampling-based Visual Projection), a novel framework that enhances vision-language alignment without relying on manually curated text-image pairs or preference annotation. SVP leverages a small set of manually selected images, self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14 % average improvement in captioning tasks, up to 12 % increase in object recall, and significantly reduced hallucinations, while maintaining question-answering capabilities. Using SVP, a small VLM achieves hallucination reductions similar to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.

Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

TL;DR

Abstract

Paper Structure (52 sections, 23 equations, 35 figures, 11 tables, 2 algorithms)

This paper contains 52 sections, 23 equations, 35 figures, 11 tables, 2 algorithms.

Introduction
Contributions
Background
Notation
Vision-Language Models
Vision-Language Grounding
Method
Problem Formulation
Sampling
Scoring
Adaptation
Inner/Outer-loop Interpretation
Experiments
Base Model Selection
Seed Images and Models
...and 37 more sections

Figures (35)

Figure 1: Improving Vision-Language Alignment. Vision-language models (VLMs) often produce descriptions lacking specificity and accuracy, frequently hallucinating objects or missing important elements (left). Our Sampling-based Visual Projection (SVP) addresses these issues by leveraging self-captioning and grounding feedback. SVP enhances visual-language alignment without requiring human annotations, curated image-text pairs, or expensive AI feedback (right). This leads to models with greater contextual relevance, fewer hallucinations, and enhanced object recall. See Appx \ref{['fig:intro-figure']} for details.
Figure 2: Referring w/ Bounding Box (left) and Segmentation Mask (right).
Figure 3: Captioning w/ 7b (left) and 13b (right) models.
Figure 4: Object Recall and Hallucination Reduction.
Figure 6: Vision-Language Generative Model (left) and Vision-Language Grounding (right)
...and 30 more figures

Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

TL;DR

Abstract

Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (35)