AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar
TL;DR
AlignVLM tackles the bottleneck of cross-modal alignment by introducing Align, a connector that maps vision features into a convex combination of the LLM's text embeddings, thereby leveraging linguistic priors to keep visual signals within the model's interpretable semantic region. The Align module projects visual features into a vocabulary probability distribution, $\mathbf{P}_{\text{vocab}}$, and forms $\mathbf{F}_{\text{align}}' = \mathbf{P}_{\text{vocab}}^T \mathbf{E}_{\text{text}}$, which is concatenated with text embeddings before feeding the LLM. Trained in three stages on CC-12M, BigDocs-7.5M, and DocDownstream datasets, AlignVLM achieves state-of-the-art results on multimodal document understanding benchmarks, with particularly strong gains under low-resource training and robust performance to noise. The approach demonstrates superior efficiency relative to deep fusion models and greater data efficiency than prior shallow-fusion connectors, enabling effective document-focused VLMs with smaller parameter overhead. The findings suggest that embedding-space constraints guided by linguistic priors can substantially improve cross-modal alignment and practical deployment.
Abstract
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM's embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.
