Table of Contents
Fetching ...

Matryoshka Query Transformer for Large Vision-Language Models

Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang

TL;DR

This work tackles the rigidity of fixed visual token budgets in Large Vision-Language Models by introducing the Matryoshka Query Transformer (MQT), which supports elastic inference with up to $M$ tokens. By training with randomly selected $m \le M$ tokens and using a Matryoshka structure, MQT-LLaVA achieves performance on par with or better than LLaVA-1.5 using only $256$ tokens (versus $576$ for the baseline), and shows substantial TFLOPs reductions (up to $8\times$ with very small losses on some tasks). The approach reveals task-dependent token requirements, with some benchmarks maintaining robustness under token reduction while others demand more tokens for fine-grained reasoning. Overall, the paper demonstrates flexible, computation-aware deployment for LVLMs and provides a nuanced analysis of the trade-offs between accuracy and efficiency across 11 benchmarks. The results have practical implications for adapting LVLMs to diverse hardware constraints and real-time applications.

Abstract

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

Matryoshka Query Transformer for Large Vision-Language Models

TL;DR

This work tackles the rigidity of fixed visual token budgets in Large Vision-Language Models by introducing the Matryoshka Query Transformer (MQT), which supports elastic inference with up to tokens. By training with randomly selected tokens and using a Matryoshka structure, MQT-LLaVA achieves performance on par with or better than LLaVA-1.5 using only tokens (versus for the baseline), and shows substantial TFLOPs reductions (up to with very small losses on some tasks). The approach reveals task-dependent token requirements, with some benchmarks maintaining robustness under token reduction while others demand more tokens for fine-grained reasoning. Overall, the paper demonstrates flexible, computation-aware deployment for LVLMs and provides a nuanced analysis of the trade-offs between accuracy and efficiency across 11 benchmarks. The results have practical implications for adapting LVLMs to diverse hardware constraints and real-time applications.

Abstract

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.
Paper Structure (31 sections, 2 equations, 8 figures, 3 tables)

This paper contains 31 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 2: Our model employs a query transformer to encode images as visual tokens. We randomly select the first $m$ tokens during training, and enable flexible choice of any$m$ number under $M$ during inference, where $M$ is the maximum number of initialized tokens.
  • Figure 3: With only 2 visual tokens, MQT-LLaVA outperforms InstructBLIP (which uses 32 visual tokens) on all 8 benchmarks it is evaluated on.
  • Figure 4: Grad-CAM visualization of 1 randomly picked token from using 8, 16, 64, 256 visual tokens, respectively, to encode an image. The model effectively concentrates on high-level concepts using fewer tokens and delves into low-level details with more tokens. The complete input to the third image is "List all the objects on the desk. The objects on the desk include a computer monitor, a keyboard, a mouse, a cell phone, and a pair of headphones".
  • Figure 5: The number of visual tokens impact different tasks differently. We log scaled x-axis for readability. Our model's performance on ScienceQA, MME-Cognition and MMMU is remarkably robust to token reduction. For full visualization of all 11 benchmarks, see Figure \ref{['fig:different_tasks_full']} and Figure \ref{['fig:different_tasks_full_nolog']} in Appendix.
  • Figure 6: Examples from MME Cognition. Grad-CAM results are from using 16 tokens which answered all the questions correctly.
  • ...and 3 more figures