Table of Contents
Fetching ...

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Mark Endo, Serena Yeung-Levy

TL;DR

The study systematically analyzes how downscaling LLMs impacts multimodal performance, revealing a pronounced drop in visually driven tasks and identifying perception as a critical bottleneck alongside reasoning. It introduces a decoupled perception–reasoning framework and the Extract+Think pipeline, combining Visual Extraction Tuning with step-by-step reasoning over extracted visuals to achieve high efficiency. The approach delivers strong performance with substantially smaller perception and reasoning modules and far fewer visual training samples, outperforming several baselines and setting a new standard for small-scale multimodal intelligence. This work provides both mechanistic insight into downscaling effects and practical methods to build compact, capable vision-language systems suitable for on-device or resource-constrained deployment.

Abstract

Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

TL;DR

The study systematically analyzes how downscaling LLMs impacts multimodal performance, revealing a pronounced drop in visually driven tasks and identifying perception as a critical bottleneck alongside reasoning. It introduces a decoupled perception–reasoning framework and the Extract+Think pipeline, combining Visual Extraction Tuning with step-by-step reasoning over extracted visuals to achieve high efficiency. The approach delivers strong performance with substantially smaller perception and reasoning modules and far fewer visual training samples, outperforming several baselines and setting a new standard for small-scale multimodal intelligence. This work provides both mechanistic insight into downscaling effects and practical methods to build compact, capable vision-language systems suitable for on-device or resource-constrained deployment.

Abstract

Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.

Paper Structure

This paper contains 14 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Overview. (1) We first analyze how downscaling language model size affects multimodal performance, finding that tasks which rely more heavily on the base LLM (e.g., general or knowledge tasks) are largely unaffected, whereas visually-demanding tasks show a disproportionate drop. (2) To uncover the mechanisms underlying the deteriorating visual capabilities under LLM downscaling, we perform a decoupled analysis of perception and reasoning, revealing that perception (alongside reasoning) is a critical bottleneck in small multimodal models. (3) To address these limitations, we present a two-stage perception–reasoning framework, featuring visual extraction tuning--which trains the model to extract instruction-relevant visual details consistently across tasks--coupled with step-by-step reasoning about the extracted visual details.
  • Figure 2: LLM downscaling exploration. (Left) Performance dropoff from LLM downscaling most notable for visually demanding tasks. Tasks like Grounding and Perceptual Similarity (e.g., NIGHTS and PieAPP) which primarily focus on visual processing are most affected by LLM downscaling, rather than tasks which rely heavily on the base LLM (such as ScienceQA evaluating knowledge or GQA assessing general abilities). (Right) The more a task's performance declines under LLM downscaling, the greater it depends on visual information. As the impact of LLM downscaling increases (8B $\rightarrow$ 0.6B), so does the task’s reliance on visual information (measured by performance difference with and without visual input). IEI=Image Edit Instruction, VST=Visual Story Telling, Spot-Diff=Spot the Difference, TR-VQA=Text-Rich VQA, MI-VQA=Multi-Image-VQA. Full plots for all datasets are provided in the supplemental material.
  • Figure 3: Decoupled perception and reasoning downscaling analysis.(a)Decoupled Setup. We disentangle perceptual and reasoning abilities using a two-stage framework: the perception module (VLM) first extracts visually relevant information, then the reasoning module (LLM) generates answers based on the extracted visual information. (b)Perception and reasoning emerge as key bottlenecks under LLM downscaling. We see that LLM downscaling of either the perception module or reasoning module largely degrades in-domain and out-of-domain task performance. (c)Perceptual degradation limits performance across tasks. Even for tasks targeting visual reasoning (e.g., IR and LR), downscaling perception has an impact comparable to--or even exceeding--that of downscaling reasoning. In this per-task analysis, the non-downscaled module is set at 8B. CP=Coarse Perception, FP=Fine-grained Perception, IR=Instance Reasoning, LR=Logical Reasoning, ST=Science & Technology.
  • Figure 4: Captioning alleviates perception bottleneck. Decoupled frameworks use an 8B reasoning module.
  • Figure 5: Visual extraction tuning.(Top)Simple pipeline for generating visual extraction tuning data. Given a visual instruction tuning example, it is converted to a visual extraction task by prompting a VLM to describe fine-grained visual details relevant to the original question. (Bottom)Visual extraction tuning enhances perception. Post-training on visual extraction data improves both in-domain and out-of-domain (MMStar) performance. Size indicates the number of parameters of the perception module's LLM. All setups use an 8B reasoning module.
  • ...and 9 more figures