Table of Contents
Fetching ...

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang

TL;DR

FUSION addresses the challenge of deep cross-modal understanding by enabling fully integrated vision-language processing throughout the processing pipeline. It introduces Text-Guided Unified Vision Encoding, Context-Aware Recursive Alignment Decoding, and Dual-Supervised Semantic Mapping Loss, complemented by a Synthesized Language-Driven QA Dataset to supervise alignment. The approach achieves state-of-the-art or competitive results with far fewer vision tokens (630, or 300 in constrained settings) across 21 benchmarks, and demonstrates strong ablations showing the value of each component and synthetic data. The work provides a scalable data-generation framework and releases code, model weights, and datasets to accelerate progress in multimodal large language models.

Abstract

We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

TL;DR

FUSION addresses the challenge of deep cross-modal understanding by enabling fully integrated vision-language processing throughout the processing pipeline. It introduces Text-Guided Unified Vision Encoding, Context-Aware Recursive Alignment Decoding, and Dual-Supervised Semantic Mapping Loss, complemented by a Synthesized Language-Driven QA Dataset to supervise alignment. The approach achieves state-of-the-art or competitive results with far fewer vision tokens (630, or 300 in constrained settings) across 21 benchmarks, and demonstrates strong ablations showing the value of each component and synthetic data. The work provides a scalable data-generation framework and releases code, model weights, and datasets to accelerate progress in multimodal large language models.

Abstract

We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

Paper Structure

This paper contains 42 sections, 23 equations, 19 figures, 29 tables.

Figures (19)

  • Figure 1: Performance comparison of FUSION with leading MLLM models across 18 benchmark dimensions. With only 630 vision tokens, our model (FUSION-X) significantly outperforms Cambrian-1 and Florence-VL, achieving overall parity with LLaVA-OneVision, while maintaining a minimal performance gap with top-tier models such as InternVL2 and Qwen2VL. Furthermore, even when the number of vision tokens is reduced to 300, our model (FUSION-L) preserves 95% of its original performance, remaining on par with Florence-VL.
  • Figure 2: Visualization of modality alignment and integration. At pixel-level, we compute attention maps between image regions and question-relevant keywords within the vision encoder. At space-level, we measure the cosine similarity between vision tokens projected into the LLM embedding space and corresponding keywords. At question-level, we visualize attention maps from question keywords to vision tokens during LLM decoding. The results indicate that our model achieves consistent and progressively enhanced cross-modal alignment throughout the processing pipeline.
  • Figure 3: Illustration of our Text-Guided Unified Vision Encoding and Dual-Supervised Semantic Mapping Loss. Given an input image, the corresponding question is first projected into the vision feature space and processed jointly with the image. The extracted visual features are then mapped into the text space and fed into LLM. To ensure the reliability of the mapping MLP, we reconstruct the text and image tokens by reusing the encoded tokens and projecting them back into their original feature spaces, then compute the similarity between the reconstructed and raw tokens to encourage structural alignment between the two spaces.
  • Figure 4: Illustration of our Context-Aware Recursive Alignment Decoding. For each set of question tokens (highlighted in yellow), we prepend a set of context-aware latent tokens (highlighted in green). Additional interaction layers are introduced between decoding layers, where vision tokens interact with both latent tokens and question tokens at a question-level granularity (e.g., Group 1, Group 2, …).
  • Figure 5: Overview of our Text-Centered QA Dataset framework. Our approach shifts the focus from visual content to textual richness by leveraging high-quality captions, enriching them with LLMs, and using them as the foundation for both image generation via diffusion models and diverse QA pair construction.
  • ...and 14 more figures