WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

Prasanna Sridhar; Horace Lee; David M. S. Pinto; Andrew Zisserman; Abhishek Dutta

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta

TL;DR

WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise, is presented.

Abstract

In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities -- for example, retrieving German trains from a historical archive by applying the object query "train" and the metadata query "Germany", or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at https://gitlab.com/vgg/wise/wise.

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

TL;DR

Abstract

Paper Structure (6 sections, 4 figures)

This paper contains 6 sections, 4 figures.

Introduction
Multimodal Search Capabilities
Audiovisual Data Processing and Search
Software Architecture
Case Studies
Discussion and Conclusion

Figures (4)

Figure 1: WISE supports multimodal search which allows (a) filtering visual search results using metadata, or (b) performing composed queries by combining an image with a refining text description. Similarly, a compositional query can be applied to face search results (c), to retrieve (d) an actor in a particular type of scene, for example.
Figure 2: WISE organises the media library into two streams: visual and audio. Scene and region level features (or embeddings) are extracted from the visual stream, which includes images and video frames, while acoustic event and speech features are extracted from the audio stream. All features are saved in a vector store to enable fast nearest neighbour search. Feature extraction and indexing are performed once as an offline process. At query time, audiovisual features are extracted from the search query and matched against the index to retrieve semantically similar content from the media library.
Figure 3: WISE can operate in aggregator mode, where a large audiovisual collection is split up across multiple standalone nodes each covering a different subset. A central instance distributes search queries to all standalone nodes, gathers their results, and presents a unified response to the user.
Figure 4: WISE has proved invaluable for journalists investigating stories based on audiovisual content (left) and for film and media scholars studying historical archives (right).

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

TL;DR

Abstract

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

Authors

TL;DR

Abstract

Table of Contents

Figures (4)