Table of Contents
Fetching ...

Multimodal Neural Databases

Giovanni Trappolini, Andrea Santilli, Emanuele Rodolà, Alon Halevy, Fabrizio Silvestri

TL;DR

Problem: querying unstructured multimodal data with database-like capabilities is challenging due to cross-modal reasoning and scale. Approach: the authors propose Multimodal Neural Databases (MMNDBs) and a three-component architecture (retriever, reasoner, aggregator), validated with a first prototype on image collections using natural-language queries (COUNT, MAX, IN) and CLIP/OFA models. Contributions: a principled design, a spectrum of retrieval strategies, and an experimental analysis highlighting robustness to noise and the impact of damaging non-relevant documents; discussion of limitations and future directions. Significance: MMNDBs aim to bridge MMIR and traditional databases, enabling scalable, expressive querying over multimodal archives and guiding future research.

Abstract

The rise in loosely-structured data available through text, images, and other modalities has called for new ways of querying them. Multimedia Information Retrieval has filled this gap and has witnessed exciting progress in recent years. Tasks such as search and retrieval of extensive multimedia archives have undergone massive performance improvements, driven to a large extent by recent developments in multimodal deep learning. However, methods in this field remain limited in the kinds of queries they support and, in particular, their inability to answer database-like queries. For this reason, inspired by recent work on neural databases, we propose a new framework, which we name Multimodal Neural Databases (MMNDBs). MMNDBs can answer complex database-like queries that involve reasoning over different input modalities, such as text and images, at scale. In this paper, we present the first architecture able to fulfill this set of requirements and test it with several baselines, showing the limitations of currently available models. The results show the potential of these new techniques to process unstructured data coming from different modalities, paving the way for future research in the area. Code to replicate the experiments will be released at https://github.com/GiovanniTRA/MultimodalNeuralDatabases

Multimodal Neural Databases

TL;DR

Problem: querying unstructured multimodal data with database-like capabilities is challenging due to cross-modal reasoning and scale. Approach: the authors propose Multimodal Neural Databases (MMNDBs) and a three-component architecture (retriever, reasoner, aggregator), validated with a first prototype on image collections using natural-language queries (COUNT, MAX, IN) and CLIP/OFA models. Contributions: a principled design, a spectrum of retrieval strategies, and an experimental analysis highlighting robustness to noise and the impact of damaging non-relevant documents; discussion of limitations and future directions. Significance: MMNDBs aim to bridge MMIR and traditional databases, enabling scalable, expressive querying over multimodal archives and guiding future research.

Abstract

The rise in loosely-structured data available through text, images, and other modalities has called for new ways of querying them. Multimedia Information Retrieval has filled this gap and has witnessed exciting progress in recent years. Tasks such as search and retrieval of extensive multimedia archives have undergone massive performance improvements, driven to a large extent by recent developments in multimodal deep learning. However, methods in this field remain limited in the kinds of queries they support and, in particular, their inability to answer database-like queries. For this reason, inspired by recent work on neural databases, we propose a new framework, which we name Multimodal Neural Databases (MMNDBs). MMNDBs can answer complex database-like queries that involve reasoning over different input modalities, such as text and images, at scale. In this paper, we present the first architecture able to fulfill this set of requirements and test it with several baselines, showing the limitations of currently available models. The results show the potential of these new techniques to process unstructured data coming from different modalities, paving the way for future research in the area. Code to replicate the experiments will be released at https://github.com/GiovanniTRA/MultimodalNeuralDatabases
Paper Structure (13 sections, 2 figures, 6 tables)

This paper contains 13 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: A possible use case for MMNDBs. Imagine walking around the city with smart glasses and collecting information in a multimodal database. In the evening, you could be interested in knowing which are good places to eat that satisfy some criteria. MMNDBs could help make that decision by answering a database-like query posed in natural language (or voice!), combining multiple information sources and modalities.
  • Figure 2: Schema for our proposed MMNDB prototype. Given a query, documents are first filtered by a retriever module. A reasoner produces intermediate answers that are them processed by an aggregator to produce the final answer.