Table of Contents
Fetching ...

A Fast Hierarchical Method for Multi-script and Arbitrary Oriented Scene Text Extraction

Lluis Gomez, Dimosthenis Karatzas

TL;DR

This paper addresses the problem of text segmentation in natural scenes from a hierarchical perspective introducing a feature space designed to produce text group hypotheses with high recall and a novel stopping rule combining a discriminative classifier and a probabilistic measure of group meaningfulness based on perceptual organization.

Abstract

Typography and layout lead to the hierarchical organisation of text in words, text lines, paragraphs. This inherent structure is a key property of text in any script and language, which has nonetheless been minimally leveraged by existing text detection methods. This paper addresses the problem of text segmentation in natural scenes from a hierarchical perspective. Contrary to existing methods, we make explicit use of text structure, aiming directly to the detection of region groupings corresponding to text within a hierarchy produced by an agglomerative similarity clustering process over individual regions. We propose an optimal way to construct such an hierarchy introducing a feature space designed to produce text group hypotheses with high recall and a novel stopping rule combining a discriminative classifier and a probabilistic measure of group meaningfulness based in perceptual organization. Results obtained over four standard datasets, covering text in variable orientations and different languages, demonstrate that our algorithm, while being trained in a single mixed dataset, outperforms state of the art methods in unconstrained scenarios.

A Fast Hierarchical Method for Multi-script and Arbitrary Oriented Scene Text Extraction

TL;DR

This paper addresses the problem of text segmentation in natural scenes from a hierarchical perspective introducing a feature space designed to produce text group hypotheses with high recall and a novel stopping rule combining a discriminative classifier and a probabilistic measure of group meaningfulness based on perceptual organization.

Abstract

Typography and layout lead to the hierarchical organisation of text in words, text lines, paragraphs. This inherent structure is a key property of text in any script and language, which has nonetheless been minimally leveraged by existing text detection methods. This paper addresses the problem of text segmentation in natural scenes from a hierarchical perspective. Contrary to existing methods, we make explicit use of text structure, aiming directly to the detection of region groupings corresponding to text within a hierarchy produced by an agglomerative similarity clustering process over individual regions. We propose an optimal way to construct such an hierarchy introducing a feature space designed to produce text group hypotheses with high recall and a novel stopping rule combining a discriminative classifier and a probabilistic measure of group meaningfulness based in perceptual organization. Results obtained over four standard datasets, covering text in variable orientations and different languages, demonstrate that our algorithm, while being trained in a single mixed dataset, outperforms state of the art methods in unconstrained scenarios.

Paper Structure

This paper contains 14 sections, 7 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: A natural scene image and a hierarchical representation of its text. Atomic objects (characters) extracted in the bottom layer are agglomerated into text groupings at different levels of the hierarchy.
  • Figure 2: Individual text-parts are less distinguishable when viewed separately, but become structurally relevant and easily identifiable when perceived as a group.
  • Figure 3: A bottom-up agglomerative clustering of individual regions produces a dendrogram in which each node represents a text group hypothesis. Our work focuses on learning the optimal features allowing the generation of pure text groups (comprising only text regions) with high recall, and designing a stopping rule that allows the efficient detection of those groups in a single grouping step.
  • Figure 4: There is no single best feature for character clustering: Characters in the same word may appear with different color (a), stroke width (b) or sizes (c).
  • Figure 5: (a) Scene image, (b) its MSER decomposition, and (c,d) two possible hierarchies built from two different weight configurations, red nodes indicate pure text groupings. The first configuration (c) yields a 28% text group recall ($\mathcal{TGR}$) while the second (d) achieves 100% for this particular image.
  • ...and 9 more figures