Table of Contents
Fetching ...

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR

ReT tackles multimodal retrieval with image+text queries by fusing multi-layer visual and textual features through a Transformer-based recurrent cell that employs forget and input gates to produce a set of tokens used for fine-grained similarity scoring via $s(\mathbf{Q},\mathbf{D}) = \sum_{i=1}^{k} \max_{j=1}^{k} \mathbf{Q}_i \cdot \mathbf{D}_j$. Trained with a symmetric InfoNCE objective on the M2KR and M-BEIR benchmarks, ReT achieves state-of-the-art results and demonstrates strong performance in retrieval-augmented VQA setups, validating the benefits of multi-layer, cross-modal fusion for robust multimodal retrieval. The work highlights the importance of leveraging both shallow and deep representations and provides a scalable framework that improves cross-modal matching across diverse datasets and tasks. The public release of code and models further enables adoption in downstream multimodal search and VQA pipelines.

Abstract

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at https://github.com/aimagelab/ReT.

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

TL;DR

ReT tackles multimodal retrieval with image+text queries by fusing multi-layer visual and textual features through a Transformer-based recurrent cell that employs forget and input gates to produce a set of tokens used for fine-grained similarity scoring via . Trained with a symmetric InfoNCE objective on the M2KR and M-BEIR benchmarks, ReT achieves state-of-the-art results and demonstrates strong performance in retrieval-augmented VQA setups, validating the benefits of multi-layer, cross-modal fusion for robust multimodal retrieval. The work highlights the importance of leveraging both shallow and deep representations and provides a scalable framework that improves cross-modal matching across diverse datasets and tasks. The public release of code and models further enables adoption in downstream multimodal search and VQA pipelines.

Abstract

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at https://github.com/aimagelab/ReT.

Paper Structure

This paper contains 17 sections, 8 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison between cross-modal retrieval with unimodal queries (top left), retrieval with multimodal queries with feature fusion wei2024uniir (top right), and our ReT (bottom left). Our approach enables retrieval with multimodal queries employing a recurrent, multi-level feature extraction process.
  • Figure 2: Overview of the proposed Recurrence-enhanced Transformer (ReT) for cross-modal retrieval with multimodal queries. Our model employs a Transformer-based recurrent cell to encode multiple vision-and-language layers into hidden vectors for similarity computation.
  • Figure 3: Graphical overview of the designed recurrent cell for the proposed retrieval model, which integrates layer-specific textual and visual features into a matricial hidden state.
  • Figure 4: Qualitative analysis of gate activations. The left-side displays gate behavior during the encoding of a single query-document pair, while the right shows average activations over 2k examples from the M2KR InfoSeek test split.
  • Figure 5: Qualitative results on M2KR, for datasets that do not include document images.
  • ...and 3 more figures