Table of Contents
Fetching ...

Multimodal Latent Reasoning via Hierarchical Visual Cues Injection

Yiming Zhang, Qiangyu Yan, Borui Jiang, Kai Han

TL;DR

HIVE introduces a loop-transformer framework for latent-space multimodal reasoning, grounding iterative latent inference in hierarchical visual cues injected across selected ViT layers. By extending Huginn with a recurrent backbone and a structured injection schedule over layers $\mathcal{L} = \{6,12,18,24\}$, the model performs multi-step reasoning entirely within latent space, controlled by recurrence depth $r$ and adaptive compute. Training occurs in three stages with progressively richer vision-language alignment data, and a dedicated image-token scheme ($<$|image|$>$ placeholder) ties visual features to the language embedding. Experimental results demonstrate that recurrence plus hierarchical cues yield substantial improvements on complex visual reasoning tasks and enable faster convergence under adaptive computation, highlighting the approach’s efficiency and robustness for real-time multimodal reasoning. These findings suggest a scalable path toward grounded, deliberative multimodal reasoning without reliance on explicit textual rationales.

Abstract

The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.

Multimodal Latent Reasoning via Hierarchical Visual Cues Injection

TL;DR

HIVE introduces a loop-transformer framework for latent-space multimodal reasoning, grounding iterative latent inference in hierarchical visual cues injected across selected ViT layers. By extending Huginn with a recurrent backbone and a structured injection schedule over layers , the model performs multi-step reasoning entirely within latent space, controlled by recurrence depth and adaptive compute. Training occurs in three stages with progressively richer vision-language alignment data, and a dedicated image-token scheme (|image| placeholder) ties visual features to the language embedding. Experimental results demonstrate that recurrence plus hierarchical cues yield substantial improvements on complex visual reasoning tasks and enable faster convergence under adaptive computation, highlighting the approach’s efficiency and robustness for real-time multimodal reasoning. These findings suggest a scalable path toward grounded, deliberative multimodal reasoning without reliance on explicit textual rationales.

Abstract

The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.
Paper Structure (25 sections, 10 equations, 12 figures, 5 tables)

This paper contains 25 sections, 10 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Visualization of traditional MLLMs, visual features extracted from a vision tower are projected into the language space and directly concatenated with text tokens. This combined sequence is then fed into a stack of transformer decoder blocks. HIVE is built upon Huginn, a recursive architecture that iteratively processes token representations through a unified set of layers to enhance feature depth. We have extended this by incorporating the visual modality and, for the first time, introducing hierarchical visual information into latent space reasoning.
  • Figure 2: Our framework incorporates a pre-trained vision encoder, with a group of lightweight patch merger that maps visual features into the LLM embedding space. During multimodal alignment, the [CLS] token is removed. represents Embedding, Recurrent, Head blocks respectively.
  • Figure 3: Building upon Huginn, we integrate a Vision Transformer (ViT) and propose a hierarchical reasoning framework latent-space . Specifically, we argue that latent-space reasoning with visual information should be hierarchical rather than merely iterative. The figure shows our comparison results on ScienceQA_$_{\text{img}}$.
  • Figure 4: MMBench detailed results. LR denotes logic reasoning. FC denotes finegrained perception (cross-instance). AR denotes attribute reasoning. RR denotes relation reasoning. FI denotes finegrained perception (instance-level). CP denotes coarse perception.
  • Figure 5: Distribution of inference steps for the first token generation across multiple-choice benchmarks. We evaluate the impact of hierarchical cues injection on the recurrent steps. The results demonstrate that incorporating these cues causes a distinct leftward shift in the distribution, indicating a reduction in the computing when inference.
  • ...and 7 more figures