Table of Contents
Fetching ...

Image-to-Markup Generation with Coarse-to-Fine Attention

Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, Alexander M. Rush

TL;DR

This paper tackles image-to-markup generation, focusing on converting rendered mathematical images into LaTeX-like markup. It replaces left-to-right OCR assumptions with a grid-based encoder and a flexible attention mechanism, introducing coarse-to-fine attention to cut computational cost. A new large dataset, Im2Latex-100k, enables robust evaluation and demonstrates that attention-based models outperform traditional OCR baselines on rendered data and can transfer to handwritten data with pretraining. The work also shows how hierarchical and coarse-to-fine attention variants trade off accuracy and efficiency, with promising implications for scalable, data-driven structured-text OCR and beyond.

Abstract

We present a neural encoder-decoder model to convert images into presentational markup based on a scalable coarse-to-fine attention mechanism. Our method is evaluated in the context of image-to-LaTeX generation, and we introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup. We show that unlike neural OCR techniques using CTC-based models, attention-based approaches can tackle this non-standard OCR task. Our approach outperforms classical mathematical OCR systems by a large margin on in-domain rendered data, and, with pretraining, also performs well on out-of-domain handwritten data. To reduce the inference complexity associated with the attention-based approaches, we introduce a new coarse-to-fine attention layer that selects a support region before applying attention.

Image-to-Markup Generation with Coarse-to-Fine Attention

TL;DR

This paper tackles image-to-markup generation, focusing on converting rendered mathematical images into LaTeX-like markup. It replaces left-to-right OCR assumptions with a grid-based encoder and a flexible attention mechanism, introducing coarse-to-fine attention to cut computational cost. A new large dataset, Im2Latex-100k, enables robust evaluation and demonstrates that attention-based models outperform traditional OCR baselines on rendered data and can transfer to handwritten data with pretraining. The work also shows how hierarchical and coarse-to-fine attention variants trade off accuracy and efficiency, with promising implications for scalable, data-driven structured-text OCR and beyond.

Abstract

We present a neural encoder-decoder model to convert images into presentational markup based on a scalable coarse-to-fine attention mechanism. Our method is evaluated in the context of image-to-LaTeX generation, and we introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup. We show that unlike neural OCR techniques using CTC-based models, attention-based approaches can tackle this non-standard OCR task. Our approach outperforms classical mathematical OCR systems by a large margin on in-domain rendered data, and, with pretraining, also performs well on out-of-domain handwritten data. To reduce the inference complexity associated with the attention-based approaches, we introduce a new coarse-to-fine attention layer that selects a support region before applying attention.

Paper Structure

This paper contains 19 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example of the model generating mathematical markup. The model generates one LaTeX symbol $y$ at a time based on the input image $\mathbf{x}$. The gray lines highlight the $H\times V$ grid features $\mathbf{{V}}$ formed by the row encoder from the CNN's output. The dotted lines indicate the center of mass of $\mathbf{\alpha}$ for each token (only non-structural tokens are shown). The blue cell indicates the support set selected by the coarse-level attention for the symbol "0", while the red cells indicate its fine-level attention. White space around the image has been trimmed for visualization. The actual size of the blue mask is $4\times 4$. See http://lstm.seas.harvard.edu/latex/ for a complete interactive version of this visualization over the test set.
  • Figure 2: Network structure. Given an input image, a CNN is applied to extract a feature map $\mathbf{\tilde{V}}$. For each row in the feature map, we employ an RNN to encode spatial layout information. The encoded fine features $\mathbf{V}$ are then used by an RNN decoder with a visual attention mechanism to produce final outputs. For clarity we only show the RNN encoding at the first row and the decoding at one step. In Section 4, we consider variants of the model where another CNN and row encoder are applied to the feature map to extract coarse features $\mathbf{V}'$, which are used to select a support region in the fine-grained features, as indicated by the blue masks.
  • Figure 3: An example synthetic handwritten image from im2latex-100k dataset.
  • Figure 4: Test accuracy (Match) of the model w.r.t. training set size.
  • Figure 5: Typical reconstruction errors on aligned images. Red denotes gold image and blue denotes generated image.