Table of Contents
Fetching ...

Active Learning for Finely-Categorized Image-Text Retrieval by Selecting Hard Negative Unpaired Samples

Dae Ung Jo, Kyuewang Lee, JaeHo Chung, Jin Young Choi

TL;DR

This work tackles the high labeling cost of finely-categorized image-text retrieval (ITR) by proposing an active learning framework that uses unpaired images to acquire paired texts, reducing annotation burden. The core idea is to select unpaired images that are likely to produce large training losses (hard negatives) for the current model, guided by a triplet-ranking loss and a novel hard-negative scoring mechanism with configurable thresholds. The method introduces hard-negative conditions and aggregation-based scoring to choose images, and it demonstrates superior performance on Flickr30K and MS-COCO compared with random or Core-set baselines, particularly in early AL stages. The approach enables cost-efficient construction of high-quality paired data for fine-grained ITR and offers tunable trade-offs between computation and performance through full/minibatch and Top-$k$ variants.

Abstract

Securing a sufficient amount of paired data is important to train an image-text retrieval (ITR) model, but collecting paired data is very expensive. To address this issue, in this paper, we propose an active learning algorithm for ITR that can collect paired data cost-efficiently. Previous studies assume that image-text pairs are given and their category labels are asked to the annotator. However, in the recent ITR studies, the importance of category label is decreased since a retrieval model can be trained with only image-text pairs. For this reason, we set up an active learning scenario where unpaired images (or texts) are given and the annotator provides corresponding texts (or images) to make paired data. The key idea of the proposed AL algorithm is to select unpaired images (or texts) that can be hard negative samples for existing texts (or images). To this end, we introduce a novel scoring function to choose hard negative samples. We validate the effectiveness of the proposed method on Flickr30K and MS-COCO datasets.

Active Learning for Finely-Categorized Image-Text Retrieval by Selecting Hard Negative Unpaired Samples

TL;DR

This work tackles the high labeling cost of finely-categorized image-text retrieval (ITR) by proposing an active learning framework that uses unpaired images to acquire paired texts, reducing annotation burden. The core idea is to select unpaired images that are likely to produce large training losses (hard negatives) for the current model, guided by a triplet-ranking loss and a novel hard-negative scoring mechanism with configurable thresholds. The method introduces hard-negative conditions and aggregation-based scoring to choose images, and it demonstrates superior performance on Flickr30K and MS-COCO compared with random or Core-set baselines, particularly in early AL stages. The approach enables cost-efficient construction of high-quality paired data for fine-grained ITR and offers tunable trade-offs between computation and performance through full/minibatch and Top- variants.

Abstract

Securing a sufficient amount of paired data is important to train an image-text retrieval (ITR) model, but collecting paired data is very expensive. To address this issue, in this paper, we propose an active learning algorithm for ITR that can collect paired data cost-efficiently. Previous studies assume that image-text pairs are given and their category labels are asked to the annotator. However, in the recent ITR studies, the importance of category label is decreased since a retrieval model can be trained with only image-text pairs. For this reason, we set up an active learning scenario where unpaired images (or texts) are given and the annotator provides corresponding texts (or images) to make paired data. The key idea of the proposed AL algorithm is to select unpaired images (or texts) that can be hard negative samples for existing texts (or images). To this end, we introduce a novel scoring function to choose hard negative samples. We validate the effectiveness of the proposed method on Flickr30K and MS-COCO datasets.
Paper Structure (32 sections, 6 equations, 4 figures, 8 tables, 2 algorithms)

This paper contains 32 sections, 6 equations, 4 figures, 8 tables, 2 algorithms.

Figures (4)

  • Figure 1: Example of an image selected by the proposed AL algorithm. (a) For an unpaired image $x_i$, (b) the algorithm calculates an aggregation weight $w_{ij}$ for text data $t_j\in T$ and $x_i$, according to the designed threshold for the hard negative condition. Then a score $h_i$ for $x_i$ can be obtained by sum $w_{ij}$ along $j$. (c) Corresponding images for texts in (b).
  • Figure 2: Performance of the proposed algorithm with Mini-batch version, according to the cardinality of $Z_s$.
  • Figure 3: Visualization of images selected by (a) the proposed algorithm and (b) the random selection in the tSNE embedding space.
  • Figure 4: Evaluation results on Flickr30K and MS-COCO. Each graph shows R@$1$ performance at each epoch of the AL scenario. $x$-axis represents the ratio of paired data to the entire dataset and $y$-axis denotes the R@$1$ performance.