Table of Contents
Fetching ...

Partial Scene Text Retrieval

Hao Wang, Minghui Liao, Zhouyi Xie, Wenyu Liu, Xiang Bai

TL;DR

A network that can simultaneously retrieve both text-line instances and their partial patches and presents a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags, which greatly improves the search efficiency and the performance of retrieving partial patches.

Abstract

The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. The source code and dataset are available at https://github.com/lanfeng4659/PSTR.

Partial Scene Text Retrieval

TL;DR

A network that can simultaneously retrieve both text-line instances and their partial patches and presents a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags, which greatly improves the search efficiency and the performance of retrieving partial patches.

Abstract

The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. The source code and dataset are available at https://github.com/lanfeng4659/PSTR.

Paper Structure

This paper contains 30 sections, 11 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Result examples of retrieving scene text instances. The target text instances cover various types of instances, such as text-line instances (a), continuous partial patches (b), and non-continuous partial patches (c). The word in blue is the English translation corresponding to the Chinese word in black.
  • Figure 2: The training phase of our proposed framework. Given an image, text-line proposals are detected, and a bag is constructed within text-line instances.The features of text-line proposals are extracted to the cross-modal similarity learning for the TIR task. Meanwhile, features of instances within the bag are extracted to ranking multiple instance learning for the PPR task. For visualization simplification, we only show the training process of one text-line instance.
  • Figure 3: The representation of boundary points (points in green or red) that are regressed from the reference point (in white).
  • Figure 4: An example of constructing a bag (c) from the labeled text-line instance.
  • Figure 5: The optimization target comparison between conventional MIL (a) and the RankMIL (b). The red point is a query text $q_j$. The red arrow and yellow arrow indicate training and abandoning this sample, respectively. In (b), from the outermost to the inner, the similarity value at the three dash circles equals to 0, the similarity between $q_j$ and $p^l_i$, and the similarity with a margin $m$ over the similarity between $q_j$ and $p^l$.
  • ...and 6 more figures