Table of Contents
Fetching ...

Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation

Shreyas Vaidya, Arvind Kumar Sharma, Prajwal Gatti, Anand Mishra

TL;DR

This work studies visual translation as a standalone problem for the first time in the literature and presents a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task.

Abstract

In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.

Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation

TL;DR

This work studies visual translation as a standalone problem for the first time in the literature and presents a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task.

Abstract

In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.
Paper Structure (16 sections, 1 equation, 7 figures, 2 tables)

This paper contains 16 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Imagine visiting Delhi, India, and arriving at the Rithala (Hindi: ErWAlA) metro station. If you are not familiar with Hindi, the signboard on the left might be incomprehensible. The result of our proposed baseline solution, shown on the right, seamlessly transliterates the station name ErWAlA to English. In our work, we aim to visually translate (or transliterate, when necessary, as in this case) text from the source language to the target language while preserving the visual attributes of the source scene text. Specifically, we focus on visual translation between Hindi and English in this work.
  • Figure 2: Outline of proposed cascaded baseline for Visual Translation. We use state-of-the-art approaches for scene text recognition, machine translation, and scene text synthesis to design variants of our baseline. Moreover, we further investigate the scene-text synthesis and propose an extension to existing SRNet architecture.
  • Figure 3: Our proposed baseline extends the SRNet scene text synthesis approach by decoupling background and foreground generation. More details provided in Section \ref{['sec:method']}.
  • Figure 4: VT-Syn dataset examples, which contains paired Eng $\rightarrow$ Hin and Hin $\rightarrow$ Eng images with diverse fonts, text colors, sizes, orientations, and background images of natural scenes, textures, and plain colors.
  • Figure 5: A few examples from VT-Real dataset, showing image and Eng-Hin and Hin-Eng ground truth translations, manually annotated by three independent annotators (referred to as users here).
  • ...and 2 more figures