Table of Contents
Fetching ...

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti

TL;DR

SPARC introduces a brain-inspired decoupled framework that separates perception from reasoning in vision-language models, enabling test-time scaling with asymmetric compute and modular optimization. A two-stage pipeline first performs implicit relevance detection to localize task-relevant image regions, then reasons over high-resolution crops to answer questions, reducing context entanglement and token costs. The approach yields strong, training-free performance gains across V$^*$, HRBench, and OOD remote sensing benchmarks, improves perceptual localization via WBF and self-consistency, and shows that targeted LoRA fine-tuning for perception can further boost results with minimal risk to reasoning. Overall, SPARC achieves competitive accuracy with an order-of-magnitude reduction in visual tokens and opens avenues for scalable, robust multimodal reasoning in dynamic inference settings.

Abstract

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

TL;DR

SPARC introduces a brain-inspired decoupled framework that separates perception from reasoning in vision-language models, enabling test-time scaling with asymmetric compute and modular optimization. A two-stage pipeline first performs implicit relevance detection to localize task-relevant image regions, then reasons over high-resolution crops to answer questions, reducing context entanglement and token costs. The approach yields strong, training-free performance gains across V, HRBench, and OOD remote sensing benchmarks, improves perceptual localization via WBF and self-consistency, and shows that targeted LoRA fine-tuning for perception can further boost results with minimal risk to reasoning. Overall, SPARC achieves competitive accuracy with an order-of-magnitude reduction in visual tokens and opens avenues for scalable, robust multimodal reasoning in dynamic inference settings.

Abstract

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200 lower token budget.
Paper Structure (23 sections, 9 figures, 9 tables)

This paper contains 23 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Overview of the SPARC framework. We decouple the VLM inference process into two distinct functional circuits. Stage 1 (Perception): The What and Where Circuits perform Implicit Relevance Detection (IRD), taking the image and question as input to output relevant crop coordinates (e.g., localizing the woman's ear). Stage 2 (Reasoning): The "Prefrontal Cortex Circuit" synthesizes a CoT by reasoning over the high-resolution crops identified in the first stage and outputs the final answer ("blue"). This separation enables independent optimization and robust, efficient test-time scaling.
  • Figure 2: The plot shows downstream reasoning accuracy against the crop overlap ratio. While performance generally degrades as overlap decreases, this effect is most pronounced for lower resolutions. Crucially, at high overlap ratios, the 256px model converges to the performance of the full-resolution model. This demonstrates that accurate perceptual guidance can fully compensate for the loss of global visual detail, allowing for highly efficient inference.
  • Figure 3: SPARC outperforms the "thinking with images" paradigm of Qwen3VL-4B, providing a more robust and efficient inference paradigm. This advantage is particularly pronounced in perceptually demanding scenarios, where SPARC achieves superior localization and reasoning with significantly fewer tokens.
  • Figure 4: We extend our analysis to the Molmo2 architecture, plotting accuracy against crop overlap ratio. Consistent with our findings on Qwen3VL, the low-resolution variants exhibit a steep performance recovery as crop precision improves. Notably, high-quality crops allow the efficient 256px and 512px models to approach the performance upper bound of the Full-resolution baseline, further supporting the motivation of the SPARC pipeline.
  • Figure 5: We measure reasoning accuracy as a function of crop expansion factor (up to $100\times$ the original box area). While moderate expansion (scales $2\times$--$4\times$) improves performance by providing necessary context, excessive scaling leads to a sharp decline for resolution-constrained models.
  • ...and 4 more figures