Table of Contents
Fetching ...

BRAVE: Broadening the visual encoding of vision-language models

Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari

TL;DR

BRAVE tackles the limited visual expressivity of vision-language models by benchmarking diverse vision encoders and introducing MEQ-Former, a lightweight fusion module that consolidates multiple encoders into a compact visual prompt for a frozen LM. This multi-encoder fusion yields state-of-the-art results on captioning and VQA tasks while improving robustness to visual biases and out-of-distribution inputs, all with substantially fewer trainable parameters than prior methods. The work also provides a systematic analysis of how encoder biases and training data shape VLM performance, and demonstrates that expanding visual biases along with efficient fusion can outperform solely scaling the language model. Overall, BRAVE highlights the value of broadening the vision axis and offers a practical, scalable path to more context-aware visual understanding in VLMs.

Abstract

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.

BRAVE: Broadening the visual encoding of vision-language models

TL;DR

BRAVE tackles the limited visual expressivity of vision-language models by benchmarking diverse vision encoders and introducing MEQ-Former, a lightweight fusion module that consolidates multiple encoders into a compact visual prompt for a frozen LM. This multi-encoder fusion yields state-of-the-art results on captioning and VQA tasks while improving robustness to visual biases and out-of-distribution inputs, all with substantially fewer trainable parameters than prior methods. The work also provides a systematic analysis of how encoder biases and training data shape VLM performance, and demonstrates that expanding visual biases along with efficient fusion can outperform solely scaling the language model. Overall, BRAVE highlights the value of broadening the vision axis and offers a practical, scalable path to more context-aware visual understanding in VLMs.

Abstract

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.
Paper Structure (29 sections, 5 figures, 7 tables)

This paper contains 29 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: We propose BRAVE to broaden the visual capabilities of vision-language models (VLMs). Left: In contrast to existing methods, e.g. InstructBLIP dai2023instructblip or LLaVA-1.5 liu2023improved, that use a single vision encoder li2023evaluatingtong2024eyes, BRAVE combines diverse features from multiple vision encoders into a more versatile and compact representation. The examples are taken from tong2024eyes and assess the VLM's ability to differentiate images with visual differences. Right:BRAVE leads to state-of-the-art performance on a wide range of captioning and visual question answering tasks. Furthermore, it significantly improves the performance on benchmarks, e.g. MMVP, where commonly employed vision encoders, e.g. CLIP, fail.
  • Figure 2: Overview of BRAVE.Left: We keep all the vision encoders (VEs) and the language model (LM) frozen. The linear projection layers are used to concatenate features from $K$ different VEs, e.g. $K=5$, sequence-wise. These are then resampled by the MEQ-Former which accepts a set of learnable queries and a text prompt describing the task as inputs. The output of MEQ-Former is projected to the input space of the LM using fully-connected (FC) layers. The total number of trainable parameters is 116M ($\approx1\%$ of the total parameters). Right: Architecture of the MEQ-Former with $N=12$ transformer layers. It interacts with the concatenated visual features through cross-attention layers and produces a fixed-length output to be fed as soft visual prompt to the frozen LM.
  • Figure 3: Overview of the evaluation tasks. They evaluate different capabilities of VLMs, which is important to understand their strengths and weaknesses. The visualizations are obtained from the corresponding publications chen2015microsoftagrawal2019nocapsgoyal2017makingli2023evaluatingmarino2019okhudson2019gqagurari2018vizwiztong2024eyes, respectively.
  • Figure 4: Qualitative results. We compare predictions of BRAVE and the VLMs with different vision encoders, e.g. CLIP, on samples from the MMVP benchmark. Following tong2024eyes, a model is considered correct only if it answers to both images in a pair correctly, i.e. if it can successfully differentiate between images with semantic differences. Note that the images in a pair are seen independently, i.e. neither of the images is provided as context for the other one. All encoders output some correct predictions, yet none of them performs consistently well on a broad range of inputs. BRAVE alleviates this by combining diverse visual features, leading to a more consistent performance. The quantitative difference is indeed stark: $42\%$ for BRAVE vs $27.3\%$ for best single encoder (Tables \ref{['tab:benchmark']} and \ref{['tab:vqa']}). See supplementary for more qualitative results.
  • Figure 5: Contribution of vision encoders to BRAVE.Left: We analyze the robustness of BRAVE when a subset of encoders are removed during evaluation. We report the average drop in CIDEr for COCO and accuracy for VQAv2. Right: We compute average attention scores for different vision encoders cross-attended by the MEQ-Former for COCO and VQAv2. See Sec. \ref{['sec:ablations']} for the discussions.