Enhancing Image-Text Matching with Adaptive Feature Aggregation

Zuhui Wang; Yunting Yin; I. V. Ramakrishnan

Enhancing Image-Text Matching with Adaptive Feature Aggregation

Zuhui Wang, Yunting Yin, I. V. Ramakrishnan

TL;DR

This work tackles image-text matching by addressing imbalanced single-modal representations that hinder cross-modal retrieval. It introduces a Feature Enhancement Module that adaptively aggregates features from each modality to produce balanced representations, paired with a new loss that uses harder negative samples generated via mixup with a Beta distribution to improve discriminative learning. The methodology combines region-based visual features with textual encodings, a six-step feature aggregation process, dimension-wise feature selection, and a four-term triplet-like loss using cosine similarity, achieving superior recall performance on Flickr30K and MS-COCO compared with several SOTA models. The results underscore the value of targeted feature augmentation and harder-negative supervision for robust cross-modal retrieval, with potential extensions to video-text retrieval.

Abstract

Image-text matching aims to find matched cross-modal pairs accurately. While current methods often rely on projecting cross-modal features into a common embedding space, they frequently suffer from imbalanced feature representations across different modalities, leading to unreliable retrieval results. To address these limitations, we introduce a novel Feature Enhancement Module that adaptively aggregates single-modal features for more balanced and robust image-text retrieval. Additionally, we propose a new loss function that overcomes the shortcomings of original triplet ranking loss, thereby significantly improving retrieval performance. The proposed model has been evaluated on two public datasets and achieves competitive retrieval performance when compared with several state-of-the-art models. Implementation codes can be found here.

Enhancing Image-Text Matching with Adaptive Feature Aggregation

TL;DR

Abstract

Enhancing Image-Text Matching with Adaptive Feature Aggregation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)