When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Soumya Jahagirdar; Walid Bousselham; Anna Kukleva; Hilde Kuehne

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne

TL;DR

Mask-LLaVA tackles the high token cost of autoregressive vision-language models by introducing a tri-granularity visual representation that fuses global CLS, pooled patch, and mask-based object tokens. It trains with an oversampled token set and enables dynamic, test-time token reduction without retraining, achieving about $75\%$ fewer tokens than the LLaVA baseline while remaining competitive across eight benchmarks. The approach leverages Deformable DETR, SAM, and MaskInversion to generate object-centric tokens and applies a norm scaling to harmonize diverse token types before fusion through a multimodal projector to an LLM. This yields an efficient, flexible VLM that maintains quality under aggressive token pruning, with strong implications for deploying high-performing VLMs on resource-constrained devices.

Abstract

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

TL;DR

fewer tokens than the LLaVA baseline while remaining competitive across eight benchmarks. The approach leverages Deformable DETR, SAM, and MaskInversion to generate object-centric tokens and applies a norm scaling to harmonize diverse token types before fusion through a multimodal projector to an LLM. This yields an efficient, flexible VLM that maintains quality under aggressive token pruning, with strong implications for deploying high-performing VLMs on resource-constrained devices.

Abstract

Paper Structure (38 sections, 4 equations, 7 figures, 10 tables)

This paper contains 38 sections, 4 equations, 7 figures, 10 tables.

Introduction
Related works
Multimodal Vision Language Models
Token Pruning for VLMs
Token Compression for VLMs
Mask-LLaVA
Visual Token Composition
Mask Token Computation
Scaling
Architecture and Training
Training:
Token reduction
Mask Token Pruning
Patch Token Pruning and pooling
Evaluation
...and 23 more sections

Figures (7)

Figure 1: Overview of Mask-LLaVA Architecture. Given an input image, the local feature extraction module pools patch tokens from the Vision Transformer ViT learning_transferable_vis_icml_2021 based on 2D grid structure to obtain local context features. Simultaneously, the SAM sam_iccv_2023 generates masks, which are used in the object feature extraction module along with the [CLS] token from ViT to obtain mask-based object representations. The explainability map from the [CLS] token is guided to focus on the corresponding masked regions. Finally, the [CLS] token, pooled local features, and object features are projected and passed to LLM along with a question to generate a response.
Figure 2: Mask-Token Computation. This figure illustrates the process of obtaining segmentation masks. First, an objectness detector deformable_detr_2020 identifies bounding boxes in the image. These bounding boxes, along with the image, are then passed to the SAM sam_iccv_2023 model to generate segmentation masks. Additionally, a background mask is included. The resulting masks are then used to extract mask-based object features.
Figure 3: Qualitative Results for POPE pope dataset.
Figure 4: Qualitative Results for GQA gqa dataset.
Figure 5: Performance Comparison of LLaVA-1.5-7b visual_intruction_tuning_2023, FastV fitprune_2024, FitPrune fitprune_2024, and Mask-LLaVA (ours). In this figure, we compare the performance of LLaVA at different token reduction rates during inference, alongside the FastV and FitPrune methods, with Mask-LLaVA. The results show that Mask-LLaVA exhibits strong performance even with a small number of tokens, and its performance trend consistently improves as the number of tokens increases. This comparison is present across four different datasets.
...and 2 more figures

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

TL;DR

Abstract

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)