Table of Contents
Fetching ...

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

Kazuki Kawamura, Akihiro Yamamoto

TL;DR

The paper tackles extracting and integrating relational information from HTML tables across web pages with varying structures. It introduces HTML-LSTM, a dual-direction Tree-LSTM operating on DOM trees with node-level encoding of tag, text, and PoS tags, augmented by Bi-LSTM content encoding. The approach uses $L_{focal}$ and $L_{f1}$ losses and data augmentation, achieving $F_1$ scores of $0.96$ on preschool data and $0.86$ on syllabus data, and it outperforms a baseline Tree-LSTM. This enables robust cross-page table integration and can be extended to non-table HTML fragments, enhancing scalable information extraction from the web.

Abstract

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

TL;DR

The paper tackles extracting and integrating relational information from HTML tables across web pages with varying structures. It introduces HTML-LSTM, a dual-direction Tree-LSTM operating on DOM trees with node-level encoding of tag, text, and PoS tags, augmented by Bi-LSTM content encoding. The approach uses and losses and data augmentation, achieving scores of on preschool data and on syllabus data, and it outperforms a baseline Tree-LSTM. This enables robust cross-page table integration and can be extended to non-table HTML fragments, enhancing scalable information extraction from the web.

Abstract

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.
Paper Structure (13 sections, 7 equations, 5 figures, 5 tables)

This paper contains 13 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The HTML-LSTM framework for information extraction and integration from HTML tables in web pages
  • Figure 2: The workflow of information extraction using HTML-LSTM
  • Figure 3: Example of converting HTML data to a tree structure
  • Figure 4: HTML-LSTM architecture
  • Figure 5: Example of information integration