Table of Contents
Fetching ...

Multi-Modal Data Exploration via Language Agents

Farhad Nooralahzadeh, Yi Zhang, Jonathan Furst, Kurt Stockinger

TL;DR

The paper tackles querying across heterogeneous data sources (tables, text, images) by introducing M^2EX, an LLM-driven framework that decomposes complex natural language questions into a Directed Acyclic Graph (DAG) of parallelizable subtasks and orchestrates modality-specific tools. It emphasizes self-debugging, selective re-planning, and explainable reasoning to optimize task execution and reduce latency and API costs. Empirical results on ArtWork, RotoWire, and EHRXQA show that M^2EX surpasses state-of-the-art baselines (CAESURA and NeuralSQL) in accuracy while delivering robust planning and transparency across multi-modal queries. The approach demonstrates the practical potential of agentic LLMs for scalable, user-centric multi-modal data exploration, with future directions including improved image reasoning, data alignment, and support for additional modalities such as video and human-in-the-loop interaction.

Abstract

International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying both structured databases and unstructured modalities (e.g., texts, images) in natural language remains largely unexplored. In this paper, we propose M$^2$EX -a system that enables multi-modal data exploration via language agents. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) M$^2$EX leverages an LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis and to orchestrate modality-specific experts in an efficient query plan. (3) Experimental results on multi-modal datasets, encompassing relational data, text, and images, demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling in both accuracy and various performance metrics, including query latency, API costs, and planning efficiency, thanks to the more effective utilization of the reasoning capabilities of LLMs.

Multi-Modal Data Exploration via Language Agents

TL;DR

The paper tackles querying across heterogeneous data sources (tables, text, images) by introducing M^2EX, an LLM-driven framework that decomposes complex natural language questions into a Directed Acyclic Graph (DAG) of parallelizable subtasks and orchestrates modality-specific tools. It emphasizes self-debugging, selective re-planning, and explainable reasoning to optimize task execution and reduce latency and API costs. Empirical results on ArtWork, RotoWire, and EHRXQA show that M^2EX surpasses state-of-the-art baselines (CAESURA and NeuralSQL) in accuracy while delivering robust planning and transparency across multi-modal queries. The approach demonstrates the practical potential of agentic LLMs for scalable, user-centric multi-modal data exploration, with future directions including improved image reasoning, data alignment, and support for additional modalities such as video and human-in-the-loop interaction.

Abstract

International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying both structured databases and unstructured modalities (e.g., texts, images) in natural language remains largely unexplored. In this paper, we propose MEX -a system that enables multi-modal data exploration via language agents. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) MEX leverages an LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis and to orchestrate modality-specific experts in an efficient query plan. (3) Experimental results on multi-modal datasets, encompassing relational data, text, and images, demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling in both accuracy and various performance metrics, including query latency, API costs, and planning efficiency, thanks to the more effective utilization of the reasoning capabilities of LLMs.

Paper Structure

This paper contains 23 sections, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: (Left): Example workflows of multi-modal data exploration in natural language over heterogeneous data sources. (Right): M$^2$EX system architecture.
  • Figure 3: M$^2$EX framework on ArtWork urbanB24 with an example of processing a multi-modal query. The query is automatically decomposed into various components such as text2SQL, and image analysis which can be inspected by the user for explainability.
  • Figure 4: Optimization of M$^2$EX: Smart replanning.
  • Figure 5: Optimization of M$^2$EX: Parallel planning.
  • Figure 6: Error analysis on different datasets: (a) CAESURA on ArtWork, (b) M$^2$EX on ArtWork, (c) CAESURA on RotoWire, (d) M$^2$EX on RotoWire, and (e) M$^2$EX on EHRXQA.