Table of Contents
Fetching ...

Composing Object Relations and Attributes for Image-Text Matching

Khoi Pham, Chuong Huynh, Ser-Nam Lim, Abhinav Shrivastava

TL;DR

CORA tackles image-text retrieval by replacing heavy cross-attention with a dual-encoder architecture that represents captions as scene graphs. A two-stage graph attention encoding captures object attributes and inter-object relations, while a contrastive and a specificity loss align image, caption, and object entities in a shared embedding space. The approach yields state-of-the-art or competitive results on Flickr30K and MS-COCO with significantly faster inference than cross-attention methods, demonstrating the practical value of structured relational representations. This work underscores the effectiveness of scene graph based text encoding for robust, scalable image-text matching in real-world retrieval systems.

Abstract

We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive, even though it is more powerful than the unimodal dual-encoder approach. This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Utilizing a graph attention network, our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system. Representing caption as a scene graph offers the ability to utilize the strong relational inductive bias of graph neural networks to learn object-attribute and object-object relations effectively. To train the model, we propose losses that align the image and caption both at the holistic level (image-caption) and the local level (image-object entity), which we show is key to the success of the model. Our model is termed Composition model for Object Relations and Attributes, CORA. Experimental results on two prominent image-text retrieval benchmarks, Flickr30K and MSCOCO, demonstrate that CORA outperforms existing state-of-the-art computationally expensive cross-attention methods regarding recall score while achieving fast computation speed of the dual encoder.

Composing Object Relations and Attributes for Image-Text Matching

TL;DR

CORA tackles image-text retrieval by replacing heavy cross-attention with a dual-encoder architecture that represents captions as scene graphs. A two-stage graph attention encoding captures object attributes and inter-object relations, while a contrastive and a specificity loss align image, caption, and object entities in a shared embedding space. The approach yields state-of-the-art or competitive results on Flickr30K and MS-COCO with significantly faster inference than cross-attention methods, demonstrating the practical value of structured relational representations. This work underscores the effectiveness of scene graph based text encoding for robust, scalable image-text matching in real-world retrieval systems.

Abstract

We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive, even though it is more powerful than the unimodal dual-encoder approach. This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Utilizing a graph attention network, our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system. Representing caption as a scene graph offers the ability to utilize the strong relational inductive bias of graph neural networks to learn object-attribute and object-object relations effectively. To train the model, we propose losses that align the image and caption both at the holistic level (image-caption) and the local level (image-object entity), which we show is key to the success of the model. Our model is termed Composition model for Object Relations and Attributes, CORA. Experimental results on two prominent image-text retrieval benchmarks, Flickr30K and MSCOCO, demonstrate that CORA outperforms existing state-of-the-art computationally expensive cross-attention methods regarding recall score while achieving fast computation speed of the dual encoder.
Paper Structure (23 sections, 12 equations, 7 figures, 9 tables)

This paper contains 23 sections, 12 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Illustration of CORA. CORA has a dual-encoder architecture, consisting of one encoder that embeds the input image and one encoder that embeds the text caption scene graph into a joint embedding space. (Best viewed in color and zoomed in.)
  • Figure 2: Overview of CORA. a) CORA consists of (1) an image encoder that detects and extracts the salient regions' features from the input image, contextualizes them through a multi-head self-attention, then aggregates them into a single image embedding through the GPO chen2021learning pooling operator, (2) a text encoder that first parses the input text into a scene graph where all semantic information is readily organized, then two graph attention networks Object-Attribute GAT and Object-Object GAT are used to encode this graph into the same joint space with the image. The red arrow denotes the edge of the active role, while the yellow arrow is for the passive role in the relation (refer to \ref{['sec:scene_gat']}). b) The semantic concept encoder that uses GRU or BERT to encode each semantic concept in the graph corresponding to the object, attribute nodes and relation edges.
  • Figure 3: Qualitative result demonstrates how CORA can perform image-to-text and image-to-entity retrieval. Green denotes correct retrieval while red denotes incorrect ones.
  • Figure 4: Inference time comparison. We compare the text-to-image retrieval inference time between our method CORA against two SOTA cross-attention methods SGRAF diao2021similarity and NAAF zhang2022negative (lower is better). The inference time is calculated with different number of images in the database. CORA with its dual-encoder architecture is much faster and scalable than cross-attention approaches.
  • Figure 5: Successful image-to-text and image-to-entity retrieval on MS-COCO. In image-to-text retrieval, green denotes matching text according to the ground truth of MS-COCO, while red denotes incorrect matching. In image-to-entity retrieval, green and red denote correct and incorrect matching, respectively, as judged subjectively by us.
  • ...and 2 more figures