Table of Contents
Fetching ...

CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan

TL;DR

CoCoVa introduces a continuous latent-space reasoning paradigm for vision–language understanding to overcome the discrete token bottleneck in traditional VLMs. It replaces one-pass, token-based reasoning with an iterative cycle that refines a chain of latent thoughts Z = {z1,...,zK} via an LQ-Former, aided by dynamic visual token selection and a multi-task objective that includes symmetric InfoNCE, diffusion-based latent reconstruction, and latent-language modeling. The approach yields superior accuracy and token efficiency on multiple benchmarks, with qualitative analyses showing interpretable, structured latent trajectories and verifiable image reconstructions from the latent thoughts. By scaling across 1.5B to 7B LLM backbones, CoCoVa demonstrates that continuous cross-modal reasoning can rival larger discrete approaches while offering better efficiency and robustness, suggesting a scalable path toward more human-like multimodal intelligence.

Abstract

In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

TL;DR

CoCoVa introduces a continuous latent-space reasoning paradigm for vision–language understanding to overcome the discrete token bottleneck in traditional VLMs. It replaces one-pass, token-based reasoning with an iterative cycle that refines a chain of latent thoughts Z = {z1,...,zK} via an LQ-Former, aided by dynamic visual token selection and a multi-task objective that includes symmetric InfoNCE, diffusion-based latent reconstruction, and latent-language modeling. The approach yields superior accuracy and token efficiency on multiple benchmarks, with qualitative analyses showing interpretable, structured latent trajectories and verifiable image reconstructions from the latent thoughts. By scaling across 1.5B to 7B LLM backbones, CoCoVa demonstrates that continuous cross-modal reasoning can rival larger discrete approaches while offering better efficiency and robustness, suggesting a scalable path toward more human-like multimodal intelligence.

Abstract

In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

Paper Structure

This paper contains 60 sections, 14 equations, 15 figures, 10 tables, 1 algorithm.

Figures (15)

  • Figure 1: Illustrative comparison of interpretability across humans, vanilla vision–language models (VLMs), and our proposed CoCoVa framework. While humans perceive nuanced affective meanings and vanilla VLMs give shallow literal captions, CoCoVa leverages latent reasoning to produce richer and more context-aware interpretations and is able to yield both coarse visual reconstructions and richer textual interpretations with a diffusion-based reconstructor and LLM backbone.
  • Figure 2: Overview of CoCoVa, which contains three core modules: (I) Token Selection — the model aggregates LLM attention and applies a $w\times w$ sliding window to select the most salient visual tokens; (II) Multimodal Latent Fusion — the LQ-Former takes the selected visual tokens and LLM's last hidden states as inputs, iteratively generating latent thoughts that integrate visual and linguistic information over $K$ reasoning steps; (III) Multi-Task Representation Learning — aims to learn a unified latent representation that consistently aligns visual and linguistic information across multiple tasks, encouraging vision-language models to form an interpretable and transferable multimodal reasoning capability.
  • Figure 3: Dynamic visual token selection identifies the most corresponding image area based on attention map.
  • Figure 4: CoCoVa four-stage training pipeline. Flame and snowflake icons are used to indicate whether the corresponding module updates its parameters, respectively.
  • Figure 5: Impact of varying the number of reasoning steps on model performance and output efficiency. Performance improves with additional steps initially before saturating, while output length decreases and stabilizes, indicating an optimal balance between reasoning depth and computational efficiency.
  • ...and 10 more figures