Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, Roberto Hernandez
TL;DR
This study systematically compares text-based image preprocessing versus direct multimodal embedding retrieval in multimodal Retrieval-Augmented Generation for financial documents. By constructing a 40-question benchmark with text and charts, the authors evaluate six LLMs and two multimodal embeddings, finding that native image embeddings ($IMG$) substantially outperform text-only approaches ($LLM_IMG$) in both retrieval metrics ($mAP@5$, $nDCG@5$) and end-to-end answer quality, with relative gains up to ~32% in $mAP@5$ and ~20% in $nDCG@5$. The results show improved factual accuracy and reduced hallucinations when images are preserved in their native form, especially for larger models with multimodal reasoning. The work highlights practical implications for production RAG systems, suggesting a shift toward multimodal embeddings to maximize retrieval precision and answer reliability in visual-rich documents such as financial reports and presentations. It also discusses limitations related to preprocessing and suggests future work across domains and automation of visual-content segmentation.
Abstract
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
