ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval
Ahmed Masry, Megh Thakkar, Patrice Bechard, Sathwik Tejaswi Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay Kalkunte Suresh, Srivatsava Daruru, Enamul Hoque, Spandana Gella, Torsten Scholak, Sai Rajeswar
TL;DR
ColMate tackles the gap in multimodal document retrieval by introducing three synergistic components: Masked OCR Language Modeling (MOLM) to enrich visual-token representations, Self-supervised Masked Contrastive Learning (MaskedCL) for cross-modal alignment without labeled pairs, and TopKSim for robust late interaction suited to visual documents. Through pretraining, self-supervised finetuning, and refined scoring, ColMate achieves state-of-the-art results on ViDoRe V1/V2, notably improving out-of-domain generalization and reducing noise from patch-based tokenization. The work demonstrates that modality-specific pretraining objectives and a balanced, top-K based similarity aggregation yield tangible gains, with MaskedCL offering particular promise in low-resource scenarios. Overall, ColMate advances practical multimodal document retrieval with strong generalization, while remaining efficient enough for real-world deployment.
Abstract
Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.
