Table of Contents
Fetching ...

Cross-Modal Entity Matching for Visually Rich Documents

Ritesh Sarkhel, Arnab Nandi

TL;DR

The paper tackles the problem of incomplete information in visually rich documents (VRDs) by introducing Juno, a cross-modal entity matching framework that maps text spans from VRDs to semantically similar relational tuples in external databases within a shared multimodal embedding space. It combines a dual-phase encoding of text spans and tuples with an asymmetric bi-directional attention mechanism to prune candidate matches and enable scalable, on-device deployment. Empirical evaluations on two real-world datasets show Juno outperforms baselines by over 6 F1 points while requiring up to 60% fewer labelled samples, and demonstrate robustness to resource constraints. This work advances VRD information extraction by addressing incompleteness in an end-to-end, generalizable, and computationally efficient manner, suitable for edge devices and diverse document layouts.

Abstract

Visually rich documents (e.g. leaflets, banners, magazine articles) are physical or digital documents that utilize visual cues to augment their semantics. Information contained in these documents are ad-hoc and often incomplete. Existing works that enable structured querying on these documents do not take this into account. This makes it difficult to contextualize the information retrieved from querying these documents and gather actionable insights from them. We propose Juno -- a cross-modal entity matching framework to address this limitation. It augments heterogeneous documents with supplementary information by matching a text span in the document with semantically similar tuples from an external database. Our main contribution in this is a deep neural network with attention that goes beyond traditional keyword-based matching and finds matching tuples by aligning text spans and relational tuples on a multimodal encoding space without any prior knowledge about the document type or the underlying schema. Exhaustive experiments on multiple real-world datasets show that Juno generalizes to heterogeneous documents with diverse layouts and formats. It outperforms state-of-the-art baselines by more than 6 F1 points with up to 60% less human-labeled samples. Our experiments further show that Juno is a computationally robust framework. We can train it only once, and then adapt it dynamically for multiple resource-constrained environments without sacrificing its downstream performance. This makes it suitable for on-device deployment in various edge-devices. To the best of our knowledge, ours is the first work that investigates the information incompleteness of visually rich documents and proposes a generalizable, performant and computationally robust framework to address it in an end-to-end way.

Cross-Modal Entity Matching for Visually Rich Documents

TL;DR

The paper tackles the problem of incomplete information in visually rich documents (VRDs) by introducing Juno, a cross-modal entity matching framework that maps text spans from VRDs to semantically similar relational tuples in external databases within a shared multimodal embedding space. It combines a dual-phase encoding of text spans and tuples with an asymmetric bi-directional attention mechanism to prune candidate matches and enable scalable, on-device deployment. Empirical evaluations on two real-world datasets show Juno outperforms baselines by over 6 F1 points while requiring up to 60% fewer labelled samples, and demonstrate robustness to resource constraints. This work advances VRD information extraction by addressing incompleteness in an end-to-end, generalizable, and computationally efficient manner, suitable for edge devices and diverse document layouts.

Abstract

Visually rich documents (e.g. leaflets, banners, magazine articles) are physical or digital documents that utilize visual cues to augment their semantics. Information contained in these documents are ad-hoc and often incomplete. Existing works that enable structured querying on these documents do not take this into account. This makes it difficult to contextualize the information retrieved from querying these documents and gather actionable insights from them. We propose Juno -- a cross-modal entity matching framework to address this limitation. It augments heterogeneous documents with supplementary information by matching a text span in the document with semantically similar tuples from an external database. Our main contribution in this is a deep neural network with attention that goes beyond traditional keyword-based matching and finds matching tuples by aligning text spans and relational tuples on a multimodal encoding space without any prior knowledge about the document type or the underlying schema. Exhaustive experiments on multiple real-world datasets show that Juno generalizes to heterogeneous documents with diverse layouts and formats. It outperforms state-of-the-art baselines by more than 6 F1 points with up to 60% less human-labeled samples. Our experiments further show that Juno is a computationally robust framework. We can train it only once, and then adapt it dynamically for multiple resource-constrained environments without sacrificing its downstream performance. This makes it suitable for on-device deployment in various edge-devices. To the best of our knowledge, ours is the first work that investigates the information incompleteness of visually rich documents and proposes a generalizable, performant and computationally robust framework to address it in an end-to-end way.
Paper Structure (18 sections, 4 equations, 6 figures, 6 tables)

This paper contains 18 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visually Rich Documents utilize visual and textual cues to highlight the semantics of various entities appearing on them. They can have diverse layouts, formats, and be used for short-form communication such as leaflets, posters and menu-cards.
  • Figure 2: An overview of Juno's end-to-end workflow is shown on the right side of this figure. It takes a text span from a visually rich document as input, encodes it as a fixed-length vector on a multimodal embedding space, and aligns it with semantically similar tuples in a relational database on that space. These tuples are then retrieved and returned back to the user.
  • Figure 3: An overview of our neural network architecture. In this example, the network maps a visually rich movie poster (A) to tuples in a relational table (B) containing supplementary information about the movie.
  • Figure 4: Visualization of bi-directional attention computed over a text-span in a movie poster and a relational tuple containing metadata about the movie. Darker shades refer to higher attention scores assigned by our network referring to higher likelihood of finding a match, whereas lighter shades refer to lower attention scores, signifying lower likelihood of finding a match. In this example, we observe higher probabilities of finding a matching tuple against the text spans "Clint Eastwood" and "Coogan's Bluff" against attributes Actor, Director and Title of a relational tuple in the database.
  • Figure 5: Sample documents from IMDB Movie Dataset (upper row) and NYC Open Event Dataset (bottom row)
  • ...and 1 more figures