When LLaVA Meets Objects: Token Composition for Vision-Language-Models
Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne
TL;DR
Mask-LLaVA tackles the high token cost of autoregressive vision-language models by introducing a tri-granularity visual representation that fuses global CLS, pooled patch, and mask-based object tokens. It trains with an oversampled token set and enables dynamic, test-time token reduction without retraining, achieving about $75\%$ fewer tokens than the LLaVA baseline while remaining competitive across eight benchmarks. The approach leverages Deformable DETR, SAM, and MaskInversion to generate object-centric tokens and applies a norm scaling to harmonize diverse token types before fusion through a multimodal projector to an LLM. This yields an efficient, flexible VLM that maintains quality under aggressive token pruning, with strong implications for deploying high-performing VLMs on resource-constrained devices.
Abstract
Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.
