Table of Contents
Fetching ...

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Qingtao Pan, Zhihao Dou, Shuo Li

Abstract

Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Abstract

Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.
Paper Structure (27 sections, 7 equations, 7 figures, 11 tables)

This paper contains 27 sections, 7 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Our FMVR (b) can restore the visual semantics from compressed tokens, alleviating the loss of visual contents in previous token compression methods (a).
  • Figure 2: Grad-CAM visualization (576 and 36 visual tokens) shows that the reduction of visual tokens leads to a noticeable degradation in visual focus.
  • Figure 3: Illustration of FMVR-LLaVA. The FMVR is injected into MRL to construct nested visual tokens, where FMVR is used to enhance the visual semantics of each visual token set, thus forming a set of reinforced nested visual tokens for LLM training.
  • Figure 4: Comparison under different numbers of vision tokens. Our method achieves higher accuracy than M3 and MQT-LLaVA.
  • Figure 5: Grad-CAM visualization and response comparison between w/o FMVR and w FMVR under 36 visual tokens. From Grad-CAM visualization, the reduction of visual tokens leads to a noticeable degradation in visual semantics, which leads to hallucinations of certain objects in the response.
  • ...and 2 more figures