Table of Contents
Fetching ...

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar

TL;DR

AlignVLM tackles the bottleneck of cross-modal alignment by introducing Align, a connector that maps vision features into a convex combination of the LLM's text embeddings, thereby leveraging linguistic priors to keep visual signals within the model's interpretable semantic region. The Align module projects visual features into a vocabulary probability distribution, $\mathbf{P}_{\text{vocab}}$, and forms $\mathbf{F}_{\text{align}}' = \mathbf{P}_{\text{vocab}}^T \mathbf{E}_{\text{text}}$, which is concatenated with text embeddings before feeding the LLM. Trained in three stages on CC-12M, BigDocs-7.5M, and DocDownstream datasets, AlignVLM achieves state-of-the-art results on multimodal document understanding benchmarks, with particularly strong gains under low-resource training and robust performance to noise. The approach demonstrates superior efficiency relative to deep fusion models and greater data efficiency than prior shallow-fusion connectors, enabling effective document-focused VLMs with smaller parameter overhead. The findings suggest that embedding-space constraints guided by linguistic priors can substantially improve cross-modal alignment and practical deployment.

Abstract

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM's embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

TL;DR

AlignVLM tackles the bottleneck of cross-modal alignment by introducing Align, a connector that maps vision features into a convex combination of the LLM's text embeddings, thereby leveraging linguistic priors to keep visual signals within the model's interpretable semantic region. The Align module projects visual features into a vocabulary probability distribution, , and forms , which is concatenated with text embeddings before feeding the LLM. Trained in three stages on CC-12M, BigDocs-7.5M, and DocDownstream datasets, AlignVLM achieves state-of-the-art results on multimodal document understanding benchmarks, with particularly strong gains under low-resource training and robust performance to noise. The approach demonstrates superior efficiency relative to deep fusion models and greater data efficiency than prior shallow-fusion connectors, enabling effective document-focused VLMs with smaller parameter overhead. The findings suggest that embedding-space constraints guided by linguistic priors can substantially improve cross-modal alignment and practical deployment.

Abstract

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM's embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.

Paper Structure

This paper contains 33 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Performance of Different VLM Connectors. The proposed Align connector outperforms other methods across benchmarks using the same training configuration. Radial distance is proportion of maximal score, truncated at $0.7$ (black dot).
  • Figure 2: AlignVLM Model Architecture. The vision encoder extracts image features, which are processed to produce probabilities over the LLM embeddings. A weighted average combines these probabilities with embeddings to generate vision input vectors. Text inputs are tokenized, and the corresponding embeddings are selected from the embedding matrix, which is then used as input to the LLM. We display the vision layers in blue, and the text layers in purple.
  • Figure 3: Probability distribution over LLM tokens, highlighting dense probabilities for whitespace tokens.
  • Figure 4: PCA of Align Embeddings: The principal components of the most influential embeddings in the Align Connector span most of the feature space represented by all embeddings.
  • Figure 5: Comparison of Llama-3.2-3b-Align and Llama-3.2-3B-MLP on the Easy and Hard VCR tasks.
  • ...and 4 more figures