Table of Contents
Fetching ...

Block-level Text Spotting with LLMs

Ganesh Bannur, Bharadwaj Amrutur

TL;DR

This work addresses block-level text spotting in natural images, a relatively underexplored level that benefits downstream tasks like translation. It introduces BTS-LLM, a pipelined system that first detects and recognizes lines, then groups them into blocks, and finally uses a large language model to determine the semantically meaningful reading order within each block and reconstruct the block text when recognition errors occur. The approach leverages a Unified Detector for line detection, PARSeq for line recognition, and GPT-3.5 Turbo for ordering with a carefully designed prompt and safeguards, including coordinate handling and length checks. Evaluated on a Hiertext-derived block-level dataset, BTS-LLM achieves competitive semantic and string similarity measures, demonstrating the potential of LLMs to enhance block-level text spotting and its downstream applications such as translation and scene-text reasoning.

Abstract

Text spotting has seen tremendous progress in recent years yielding performant techniques which can extract text at the character, word or line level. However, extracting blocks of text from images (block-level text spotting) is relatively unexplored. Blocks contain more context than individual lines, words or characters and so block-level text spotting would enhance downstream applications, such as translation, which benefit from added context. We propose a novel method, BTS-LLM (Block-level Text Spotting with LLMs), to identify text at the block level. BTS-LLM has three parts: 1) detecting and recognizing text at the line level, 2) grouping lines into blocks and 3) finding the best order of lines within a block using a large language model (LLM). We aim to exploit the strong semantic knowledge in LLMs for accurate block-level text spotting. Consequently if the text spotted is semantically meaningful but has been corrupted during text recognition, the LLM is also able to rectify mistakes in the text and produce a reconstruction of it.

Block-level Text Spotting with LLMs

TL;DR

This work addresses block-level text spotting in natural images, a relatively underexplored level that benefits downstream tasks like translation. It introduces BTS-LLM, a pipelined system that first detects and recognizes lines, then groups them into blocks, and finally uses a large language model to determine the semantically meaningful reading order within each block and reconstruct the block text when recognition errors occur. The approach leverages a Unified Detector for line detection, PARSeq for line recognition, and GPT-3.5 Turbo for ordering with a carefully designed prompt and safeguards, including coordinate handling and length checks. Evaluated on a Hiertext-derived block-level dataset, BTS-LLM achieves competitive semantic and string similarity measures, demonstrating the potential of LLMs to enhance block-level text spotting and its downstream applications such as translation and scene-text reasoning.

Abstract

Text spotting has seen tremendous progress in recent years yielding performant techniques which can extract text at the character, word or line level. However, extracting blocks of text from images (block-level text spotting) is relatively unexplored. Blocks contain more context than individual lines, words or characters and so block-level text spotting would enhance downstream applications, such as translation, which benefit from added context. We propose a novel method, BTS-LLM (Block-level Text Spotting with LLMs), to identify text at the block level. BTS-LLM has three parts: 1) detecting and recognizing text at the line level, 2) grouping lines into blocks and 3) finding the best order of lines within a block using a large language model (LLM). We aim to exploit the strong semantic knowledge in LLMs for accurate block-level text spotting. Consequently if the text spotted is semantically meaningful but has been corrupted during text recognition, the LLM is also able to rectify mistakes in the text and produce a reconstruction of it.
Paper Structure (21 sections, 7 figures, 8 tables)

This paper contains 21 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The four levels of text spotting
  • Figure 2: A set of lines identified as being part of the same block are highlighted in the image. The first part of the pipeline will identify the bounding box and the corresponding text of each line. Based on their bounding boxes there are two ways to order these lines, which are shown as the two possibilities for the array . The correct order is the one on top however if we were to simply read left-to-right in the image we would pick the bottom order. To know which one to pick it becomes important to read the text and decide which order makes more sense.
  • Figure 3: The pipeline of BTS-LLM is as follows. An input image is given to the Text Detection+Grouping Model. This model performs line-level text detection and finds the bounding boxes of all the lines in the image. It then groups lines in close proximity into blocks. In the figure, the lines detected are numbered from 1 to 8 and the grouping is shown in the array . The detected regions are given to the Text Recognition Model which recognizes the line of text in each region. Finally, for every block, the texts recognized and the bounding boxes of the lines are given to the LLM which outputs the text for the block.
  • Figure 4: We assume that blocks fall into one of two categories
  • Figure 5: The prompt provided to the LLM. BTS-LLM uses GPT-3.5 Turbo which requires a system prompt. For other LLMs which do not have a system prompt, it can be appended before the user prompt. The system prompt given was empirically found to be the best for two reasons: 1) It suppresses extraneous output. 2) It forces GPT-3.5 Turbo to always give an answer (the best possible one, given the input). Finally, leaving the system prompt blank was seen to severely decrease the quality of outputs. GPT-3.5 Turbo is accessed via the APIwith .
  • ...and 2 more figures