Table of Contents
Fetching ...

Matryoshka Multimodal Models

Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

TL;DR

Matryoshka Multimodal Models (M$^3$) address the inefficiency of fixed, dense visual token prefixes in large multimodal models by learning nested, coarse-to-fine visual token sets within a single weight family. By training across multiple token scales derived from hierarchical pooling of CLIP-based visual features, M$^3$ enables explicit inference-time control of visual granularity without adding parameters. Empirically, many benchmarks achieve near full-token performance with as few as ~9 tokens per image, while zero-shot tests indicate longer visual sequences can generalize with compact representations; a notable gap between oracle and actual performance highlights room for a token-scale predictor. The work also offers a framework to analyze dataset visual richness and paves the way for adaptive token-length strategies in vision-language reasoning, with potential extensions to other modalities and longer-context settings.

Abstract

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.

Matryoshka Multimodal Models

TL;DR

Matryoshka Multimodal Models (M) address the inefficiency of fixed, dense visual token prefixes in large multimodal models by learning nested, coarse-to-fine visual token sets within a single weight family. By training across multiple token scales derived from hierarchical pooling of CLIP-based visual features, M enables explicit inference-time control of visual granularity without adding parameters. Empirically, many benchmarks achieve near full-token performance with as few as ~9 tokens per image, while zero-shot tests indicate longer visual sequences can generalize with compact representations; a notable gap between oracle and actual performance highlights room for a token-scale predictor. The work also offers a framework to analyze dataset visual richness and paves the way for adaptive token-length strategies in vision-language reasoning, with potential extensions to other modalities and longer-context settings.

Abstract

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.
Paper Structure (24 sections, 2 equations, 6 figures, 9 tables)

This paper contains 24 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Matryoshka Multimodal Models. We enforce the coarser set of visual tokens $\mathbf{X} _{S_{i-1}}$ to be derived from the finer level of visual tokens $\mathbf{X} _{S_i}$. As a result, the granularity of Matryoshka visual tokens gradually changes in a controllable manner. The image is from MSCOCO lin2014microsoft validation set.
  • Figure 2: MMBench evaluation results under M$^3$, oracle under LLaVA-1.5-M$^3$, LLaVA-1.5 with average pooling at inference time, LLaVA-1.5 separately trained for each specific scale, and other methods. M$^3$ shows as least as good performance as LLaVA trained for each specific scale. A large gap exists between the oracle upperbound and model's actual performance on a specific scale.
  • Figure 3: Architecture of our proposed Matryoshka Multimodal Models. The visual features from CLIP are represented as several groups of coarse-to-fine visual tokens. At test time, users can explicitly control the granularity of the visual features.
  • Figure 4: TextVQA test samples with correct and incorrect predictions upon different scales. Answers vary with different number of visual tokens. In addition, M$^3$ can serve as a framework to evaluate the complexity of images.
  • Figure 5: Visualization of sequential and spatial sampling. Given $24\times24$ girds, the visualized cells denote the sampled tokens.
  • ...and 1 more figures