Table of Contents
Fetching ...

Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching

Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Jie Wang, Joemon M. Jose

TL;DR

A Hybrid-modal Interaction with multiple Relational Enhancements (termed Hire) for ITM, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modeling.

Abstract

Image-text matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed \textit{Hire}) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects with salient spatial and semantic relational connectivities, guided by the explicit relationships of the objects' spatial positions and their scene graph. We use implicit relationship modelling for potential relationship interactions before explicit modelling to improve the fault tolerance of explicit relationship detection. Then the visual and textual semantic representations are refined jointly via inter-modal interactive attention and cross-modal alignment. To correlate the context of objects with the textual context, we further refine the visual semantic representation via cross-level object-sentence and word-image-based interactive attention. Extensive experiments validate that the proposed hybrid-modal interaction with implicit and explicit modelling is more beneficial for image-text matching. And the proposed \textit{Hire} obtains new state-of-the-art results on MS-COCO and Flickr30K benchmarks.

Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching

TL;DR

A Hybrid-modal Interaction with multiple Relational Enhancements (termed Hire) for ITM, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modeling.

Abstract

Image-text matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed \textit{Hire}) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects with salient spatial and semantic relational connectivities, guided by the explicit relationships of the objects' spatial positions and their scene graph. We use implicit relationship modelling for potential relationship interactions before explicit modelling to improve the fault tolerance of explicit relationship detection. Then the visual and textual semantic representations are refined jointly via inter-modal interactive attention and cross-modal alignment. To correlate the context of objects with the textual context, we further refine the visual semantic representation via cross-level object-sentence and word-image-based interactive attention. Extensive experiments validate that the proposed hybrid-modal interaction with implicit and explicit modelling is more beneficial for image-text matching. And the proposed \textit{Hire} obtains new state-of-the-art results on MS-COCO and Flickr30K benchmarks.
Paper Structure (23 sections, 13 equations, 5 figures, 6 tables)

This paper contains 23 sections, 13 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the explicit and implicit intra-modal modelling schemas for the semantic relationship. ① the explicit spatial-semantic relationship modelling schema: objects along with their spatial and semantic relationships are jointly modelled based on the relative position and the detected scene-graphs. However, the subject-relation-object pairs (③) in detected scene graphs of each image usually have some errors or do not match the text. For example, in window-on-train, the word labels of relation "on" and object "train" are hard to accurately represent the corresponding semantic content, or even wrong (in red). To this end, the relational connectivity (relationship exists or not) rather than the object/attribute label is encoded into the object features. In addition, some relation pairs are even missing due to the limitation on the label range of the offline detector, e.g.truck-with-window. Fortunately, it can be relieved by the implicit relationship modelling (②) due to its construction of the general relationship among object regions. ② the implicit relationship modelling schema: object relationships are constructed by fully connecting the object regions, where the information can be propagated and aggregated among objects according to their potential relationships. However, it is hard to maintain strong inter-object relationships in a multi-layer network. To deal with the above issues, it’s intuitive to combine both implicit and explicit relationship modelling to cooperate visual semantic representation with the inter-object relationship.
  • Figure 2: The overall framework (image-to-text version) of Hire. In intra-modal semantic correlation (① and ②), an implicit relationship reasoning is first used to obtain the potential semantic connections among all candidate regions, similarly for high-level textual word embeddings from pre-trained BERT. And then, a relationship-aware GCNs (R-GCNs) is constructed to integrate the explicit spatial and semantic relationships between every two objects into their region representations by changing the relationship-determined graph adjacency matrix. In inter-modal semantic correlation (③ and ④), the visual and textual semantic features are further enhanced via object-word interactive attention and the visual semantic representation is refined via the cross-level object-sentence and word-image-based interactive attention. Visual and textual semantic similarity is finally estimated for the cross-modal alignment.
  • Figure 3: Visualization of main modules: (i) the refined relationships between the target object (in green box) and other correlated object regions after implicit visual relationship reasoning (VSA) and explicit visual spatial-semantic graph reasoning (VSSG), (ii) results on top-4 region-words pair correspondences of each target object (in green box) for image-to-text, (iii) results on top-5 word-regions pair correspondences of each target word for text-to-image. The degree of white coverage of regions and the thickness of lines indicate different learning weights (best viewed in color).
  • Figure 4: Comparisons of image-to-text matching between the proposed Hire and DIME [29] on MS-COCO (at the top) and Flickr30K (at the bottom). For each image query, we present the top-5 retrieved sentences, where the mismatches are highlight in red.
  • Figure 5: Comparisons of text-to-image matching between our Hire and DIME [29] on MS-COCO and Flickr30K. For each text query, we present the top 3 ranked images, ranking from left to right. The correctly matched images are marked in green and the mismatched images are marked in red (best viewed in color).