Table of Contents
Fetching ...

QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining

Kyle R. Chickering, Bangzheng Li, Muhao Chen

TL;DR

This work tackles the difficulty of fine-grained VQA with multimodal LLMs when using CLIP-style vision encoders by identifying two underlying biases: mesoscopic bias from uniform patch grids and interpolation bias from fixed positional embeddings. It introduces QLIP, a lightweight, drop-in replacement that combines a content-aware Vision Quadtree Patch (QtP) with a coordinate-based MLP to interpolate positional signals, achieving no-retraining requirements for the underlying MLLM. The approach yields substantial gains, notably +$13.6\%$ on the $V^*$ benchmark with LLaVA-13B and a $5.2$ point reduction in POPE F1, while maintaining performance across other benchmarks and reducing the token budget. This enables practical deployment of high-resolution VQA in existing MLLMs without expensive re-training or fine-tuning, expanding the applicability of MLLMs to more detailed visual reasoning tasks.

Abstract

Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we propose QLIP, a drop-in replacement for CLIP that can be seamlessly integrated with existing MLLMs with only a few lines of code and can enhance both coarse-grained and fine-grained visual understanding, without re-training. QLIP is designed around an image quadtree which replaces the standard uniform grid patches with a novel content aware patchification. Our experimental results demonstrate that QLIP improves the general visual question answering accuracy of the LLaVA v1.5 model series across various model sizes--without requiring retraining or fine-tuning of the full MLLM. Notably, QLIP boosts detailed understanding performance on the challenging $V^{\ast}$ benchmark by up to 13.6 percent.

QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining

TL;DR

This work tackles the difficulty of fine-grained VQA with multimodal LLMs when using CLIP-style vision encoders by identifying two underlying biases: mesoscopic bias from uniform patch grids and interpolation bias from fixed positional embeddings. It introduces QLIP, a lightweight, drop-in replacement that combines a content-aware Vision Quadtree Patch (QtP) with a coordinate-based MLP to interpolate positional signals, achieving no-retraining requirements for the underlying MLLM. The approach yields substantial gains, notably + on the benchmark with LLaVA-13B and a point reduction in POPE F1, while maintaining performance across other benchmarks and reducing the token budget. This enables practical deployment of high-resolution VQA in existing MLLMs without expensive re-training or fine-tuning, expanding the applicability of MLLMs to more detailed visual reasoning tasks.

Abstract

Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we propose QLIP, a drop-in replacement for CLIP that can be seamlessly integrated with existing MLLMs with only a few lines of code and can enhance both coarse-grained and fine-grained visual understanding, without re-training. QLIP is designed around an image quadtree which replaces the standard uniform grid patches with a novel content aware patchification. Our experimental results demonstrate that QLIP improves the general visual question answering accuracy of the LLaVA v1.5 model series across various model sizes--without requiring retraining or fine-tuning of the full MLLM. Notably, QLIP boosts detailed understanding performance on the challenging benchmark by up to 13.6 percent.

Paper Structure

This paper contains 31 sections, 6 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: QLIP is a drop-in replacement for CLIP which allows models like LLaVA to perform inference on arbitrarily large images. In our experiments we find that vanilla LLaVA + QLIP gives +13.6% accuracy on the challenging $V^*$ benchmark with no re-training or fine-tuning. The example in the figure above demonstrates an instance where CLIP cannot correctly get the answer because (a) in the cropped version of the image the person in question is not present, and (b) if we use a padded image the person will be too small to provide meaningful signal to model.
  • Figure 2: An example of the same semantic feature (animal:elephant) at three different spatial scales. These photos could be accompanied by the question What animal is shown in this photo? For the leftmost image the elephant fits into a single patch. Without memorization it is unlikely for any classifier to be able to accurately identify the pixilated blob as an elephant instead of, for example, a horse or a buffalo.
  • Figure 3: (a) An example of the quadtree patchification (QtP) applied to a high-resolution image. QtP uses only 25% of the original number of tokens yet retains a high-degree of semantic information. Photo courtesy of first author. (b) A schematic of a $4\times 4$ patch image being decomposed into $7$ leaf patches using a quadtree. Leaves which consist of more than a single patch are downsampled to the patch size.
  • Figure 4: The first two panels compare of our MLP interpolation with bicubic interpolation. We plot $\mathcal{B}_{\text{Interp}}$ in the first panel as a measure of interpolation bias and $C_{N\rightarrow 336}^z$ in the middle panel as a measure of mesoscopic bias. The third panel shows a comparison between the $\text{[CLS]}$ tokens of various image sizes with (blue) and without (red) QtP. All data is collected and averaged over the images from the $V^*$ benchmark.
  • Figure 5: The performance on $V^*$ using re-scaled and cropped images with no quadtree selection mechanism and our MLP interpolation. The red line is with bicubic interpolation and the orange line is with bilinear interpolation. The black line represents performance of the base CLIP model with $336\times 336$ cropping. The 7B model is plotted on the left, and the 13B model on the right. We see that neither bilinear nor bicubic interpolation is suitable for extending CLIP to larger resolutions.
  • ...and 14 more figures