Table of Contents
Fetching ...

Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment

Yang Liu, Mengyuan Liu, Shudong Huang, Jiancheng Lv

TL;DR

The paper tackles the challenge of measuring visual-semantic similarity in image-text matching by addressing the unequal information density between modalities. It introduces AVSE, which uses Radial Bias Sampling to create multi-view image features and an AEOM module that decomposes embeddings into meta-semantic units for dynamic, dimension-aware matching with $S(I,T)=\sum_{j=1}^q \max_i A_{i,j}$ and final loss $\mathcal{L}=\mathcal{L}_m+\mathcal{L}_{reg}$, achieving $O(n)$ complexity. A dimension-wise regularization loss further aligns semantic channels across views, improving alignment between modalities. Empirically, AVSE attains state-of-the-art results on MS-COCO and Flickr30K across multiple backbones and runs faster than local-cross-attention methods, highlighting its practicality for large-scale vision-language retrieval.

Abstract

Learning visual semantic similarity is a critical challenge in bridging the gap between images and texts. However, there exist inherent variations between vision and language data, such as information density, i.e., images can contain textual information from multiple different views, which makes it difficult to compute the similarity between these two modalities accurately and efficiently. In this paper, we propose a novel framework called Asymmetric Visual Semantic Embedding (AVSE) to dynamically select features from various regions of images tailored to different textual inputs for similarity calculation. To capture information from different views in the image, we design a radial bias sampling module to sample image patches and obtain image features from various views, Furthermore, AVSE introduces a novel module for efficient computation of visual semantic similarity between asymmetric image and text embeddings. Central to this module is the presumption of foundational semantic units within the embeddings, denoted as ``meta-semantic embeddings." It segments all embeddings into meta-semantic embeddings with the same dimension and calculates visual semantic similarity by finding the optimal match of meta-semantic embeddings of two modalities. Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment

TL;DR

The paper tackles the challenge of measuring visual-semantic similarity in image-text matching by addressing the unequal information density between modalities. It introduces AVSE, which uses Radial Bias Sampling to create multi-view image features and an AEOM module that decomposes embeddings into meta-semantic units for dynamic, dimension-aware matching with and final loss , achieving complexity. A dimension-wise regularization loss further aligns semantic channels across views, improving alignment between modalities. Empirically, AVSE attains state-of-the-art results on MS-COCO and Flickr30K across multiple backbones and runs faster than local-cross-attention methods, highlighting its practicality for large-scale vision-language retrieval.

Abstract

Learning visual semantic similarity is a critical challenge in bridging the gap between images and texts. However, there exist inherent variations between vision and language data, such as information density, i.e., images can contain textual information from multiple different views, which makes it difficult to compute the similarity between these two modalities accurately and efficiently. In this paper, we propose a novel framework called Asymmetric Visual Semantic Embedding (AVSE) to dynamically select features from various regions of images tailored to different textual inputs for similarity calculation. To capture information from different views in the image, we design a radial bias sampling module to sample image patches and obtain image features from various views, Furthermore, AVSE introduces a novel module for efficient computation of visual semantic similarity between asymmetric image and text embeddings. Central to this module is the presumption of foundational semantic units within the embeddings, denoted as ``meta-semantic embeddings." It segments all embeddings into meta-semantic embeddings with the same dimension and calculates visual semantic similarity by finding the optimal match of meta-semantic embeddings of two modalities. Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

Paper Structure

This paper contains 14 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (Top) Information density varies in vision and language data, e.g., an image can be described from multiple different views using language. (Bottom) The conceptual diagram of our proposed method. We first sample different patches (Radial Bias Sampling) to the shard encoder (Vision Transformers) to get two group embedding, and compute dynamic similarity for different text by selecting different parts of the image feature (Asymmetric Embedding Optimal Matching).(Best viewed in color).
  • Figure 2: Difference between our proposed AVSE framework and previous methods. “Dynamic features” refer to hybrid features with cross-modal attention, which is a computationally expensive operation. "Dynamic matching" means that when calculating similarity, the meta-semantic embedding of the sentence is used to find the most similar meta-semantic embedding in the image. The process is very simple and only requires cosine similarity.
  • Figure 3: An overview of Asymmetric Visual Semantic Embedding. Asymmetric Feature Extraction extract image features from different views to bridge the inherent differences in information density between images and texts. Asymmetric Embedding Optimal Matching attempts to learn meta-semantic embeddings of different modalities and calculate similarity through the optimal matching of meta-semantic embeddings between images and texts. Dimension-wise Regularization regularizes the embeddings of different image views to assist in learning meta-semantic embeddings.
  • Figure 4: Inference time for image-text retrieval on GPU (lower the better). Our AVSE method is almost the same as VSE in calculating similarity speed, and much faster than the local-level matching method, especially when the number of images grows large.
  • Figure 5: Visualization of the radial bias sampling strategy, which can effectively extract multi-view information.