Multi-Modal Data Exploration via Language Agents
Farhad Nooralahzadeh, Yi Zhang, Jonathan Furst, Kurt Stockinger
TL;DR
The paper tackles querying across heterogeneous data sources (tables, text, images) by introducing M^2EX, an LLM-driven framework that decomposes complex natural language questions into a Directed Acyclic Graph (DAG) of parallelizable subtasks and orchestrates modality-specific tools. It emphasizes self-debugging, selective re-planning, and explainable reasoning to optimize task execution and reduce latency and API costs. Empirical results on ArtWork, RotoWire, and EHRXQA show that M^2EX surpasses state-of-the-art baselines (CAESURA and NeuralSQL) in accuracy while delivering robust planning and transparency across multi-modal queries. The approach demonstrates the practical potential of agentic LLMs for scalable, user-centric multi-modal data exploration, with future directions including improved image reasoning, data alignment, and support for additional modalities such as video and human-in-the-loop interaction.
Abstract
International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying both structured databases and unstructured modalities (e.g., texts, images) in natural language remains largely unexplored. In this paper, we propose M$^2$EX -a system that enables multi-modal data exploration via language agents. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) M$^2$EX leverages an LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis and to orchestrate modality-specific experts in an efficient query plan. (3) Experimental results on multi-modal datasets, encompassing relational data, text, and images, demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling in both accuracy and various performance metrics, including query latency, API costs, and planning efficiency, thanks to the more effective utilization of the reasoning capabilities of LLMs.
